Basic Diagnostics for Correctable/Uncorrectable ECC Memory Errors with Intel® Server Boards

Documentation

Troubleshooting

000024007

12/21/2023

Note For support of troubleshooting described in this article, please refer to the Technical Product Specifications for your server platform.

What am I seeing?

Correctable and/or Uncorrectable Error Correcting Code (ECC) events for memory modules. For example:

Mmry ECC Sensor SMI Handler Warning Memory CPU: 1, DIMM: D0 DIMM Rank: 1. - Correctable ECC / other correctable memory error - Asserted.

What is Memory Error Correction Code (ECC) Correctable Error Event?

ECC correctable error represents a threshold overflow for a given Dual In-line Memory Modules (DIMM) within a given timeframe.


How to fix it:

Memory data errors are logged as correctable or uncorrectable. Refer to the instructions below, based on the error type you encounter:

error types

Notes
  • If there is no catastrophic issue (Purple Screen of Death (PSOD) or unexpected restart)  and the correctable ECC error, including Adaptative Double Device Data Correction (ADDDC) error, is less than 10 events every 24 hours for each DIMM location, which is within the threshold limit,  the recommendation is to monitor the server for any recurrence of ECC error each DIMM location that triggers the event.
     
  • If there is a catastrophic issue (Purple Screen of Death (PSOD) or unexpected restart)  and the correctable ECC error, including Adaptative Double Device Data Correction (ADDDC) error, is less than 10 events every 24 hours for each DIMM location, it is recommended to re-seat each DIMM location by following the steps below:
    1. Power OFF the system and remove the AC power cable.
    2. Identify the DIMM location to re-seat. Refer to the Technical Product Specifications for your server platform to identify the DIMM location.
    3. Perform the re-seat of identified DIMM.
    4. Insert the AC power cable and power back ON the system.
    5. Observe for 24 hours for any recurrence of ECC error.
    6. If the ECC error persists with the same DIM location that was re-seated, then generate and send the SEL and Debug logs, both generated from the BMC Web Console to Intel Customer Support
  • The advanced memory test (AMT) features were introduced in the BIOS and firmware stack starting with the BIOS revision 02.01.0014 for the Intel® Server Systems S2600BP, S2600WF, and S2600ST; and starting with the BIOS revision 22.01.0097 for the Intel® Server System S9200WK. For these products, recommend to enable the advanced memory test (AMT) and post package repair (PPR) features through the BIOS setup utility to perform a full check of the memory health. Refer to Chapter 5 in Memory Replacement Guideline and Advanced Memory Test for Intel® Server Products Based on Intel® 62X Chipset – White Paper for detail steps.

Notes

The Error Correction Code (ECC) errors are self-correcting. Depending on the Reliability Availability Serviceability (RAS) configuration of the memory, the Integrated Memory Controller (IMC) may take the affected DIMM offline.

For different Intel server platforms, there are some differences in their event definition, refer to System Event Log Troubleshooting Guide for your server platform

Intel recommends downloading and updating the system BIOS to the latest available version for your server platform.

If the system is an Intel® Data Center Block for Nutanix* Enterprise Cloud, rather, visit the Nutanix* Life Cycle Manager page. For a list of hardware and firmware compatibility, visit the Nutanix* Hardware and Firmware compatibility page.

 

Related topics
Memory Replacement Guideline and Advanced Memory Test for Intel® Server Products Based on Intel® 62X Chipset – White Paper
The Role of ECC Memory
How to Recover from an IERR for Intel® Server Boards
My Server Crashes and Shows this Error: Processor CPU Machine Chk
For firmware updates and troubleshooting tips
What is Memory Error Correction Code (ECC) Correctable Error Event?
SDLA Tool How to count ECC Errors