Intel® In-Field Scan for 5th Gen Intel® Xeon® Processors Improves Server Fleet Management
Intel® In-Field Scan Overview
5th Gen Intel® Xeon® Scalable processors, formerly codenamed Emerald Rapids, introduced a new Reliability, Availability, and Serviceability (RAS) capability called Intel® In-Field Scan. This is a family of tools designed to help system administrators quickly and easily find processors that have failed over time. Intel® In-Field Scan has a roadmap of capabilities that will be included on current and future processors. Scan-at-Field (SAF) and Array Built In Self Test (BIST) are the first two features within the In-Field Scan family, and both are available on 5th Gen Intel® Xeon® processors.
Intel® In-Field Scan is minimally intrusive and designed to quickly test one core, while all the other cores in the node continue to run customer workloads.
Scan At Field Summary
Scan* is an industry-standard method for detecting faults in semiconductor devices. Until now, scan has been used by specialized test equipment in chip manufacturing factories. Intel uses scan to test processors during High-Volume Manufacturing (HVM).
Scan-At-Field enables customers to run a subset of Intel’s manufacturing scan tests to check individual processing cores for faults. Using Intel-supplied test patterns (called Scan Test Images), each core within the processor package can be independently tested to confirm proper operation.
Array Built In Self Test (BIST)
Array BIST checks the L1 (Level 1) and L2 (Level 2) caches and many of the register files and data arrays in each core. Being a Built In Self Test (BIST), there are no test images to load; all the testing is coordinated by a dedicated test module in each core.
More Information
A high-level technical overview of SAF and ArrayBIST is provided in the Finding Faulty Components in a Live Fleet Environment technical paper. Details on the system requirements and how to run In-Field Scan are provided in the Intel® In-Field Scan for 5th Gen Intel® Xeon® processor Enabling Guide.
Intel® In-Field Scan is an important step forward in the domain of reliability and availability services, as it enables customers to use industry test capabilities to rapidly identify defective units in their fleet.
System Requirements
There are hardware and software requirements to enable Intel® In-Field Scan on a platform. Below is a summary of the requirements.
- An Intel® Xeon® processor that supports Intel® In-Field Scan
- Scan Test Images (scan test patterns for the cores)
- The Intel® In-Field Scan Linux device driver
- The Intel® In-Field Scan Application
Testing and Test Results
Intel® In-Field Scan is designed and optimized to be used by system administrators to test the fleet periodically to ensure the processors are operating correctly. Intel® In-Field Scan provides system administrators with a very fast processor test that can be run on live nodes (meaning a node that is online and running user applications) without interrupting the entire node’s operation. In this case, the term very fast means ~200ms or less.
Periodic testing of the fleet is recommended to find components that have failed over time. How often to test the fleet and how extensive of a test to run is a complex question. Many variables come into play, for example: How long has the processor been running; what is the predicted Failure in Time (FIT) 2 rate of the processor; what is the customer’s tolerance for SDE (Silent Data Errors); and the amount time the system administrator is willing to devote to proactive system maintenance.
The Finding Faulty Components in a Live Fleet Environment technical paper provides considerations for and an example of how often In-Field Scan can be run.
The Intel® In-Field Scan for 5th Generation Intel Xeon Processors Enabling Guide has detailed information on how to run and test and understand results.
The Intel® In-Field Scan Scan Test Images for 5th Generation Intel Xeon processors and instructions for checking the version or loading a new image are posted (NDA account required - How to Apply for an Intel® Resource and Documentation Center).
The Intel® In-Field Scan Application is posted (NDA account required - How to Apply for an Intel® Resource and Documentation Center).
Conclusion
In a fleet with hundreds of thousands, or millions, of processors, failures may occur on a regular basis. Finding these defects as quickly as possible is key to minimizing the interruptions to customer operations.
Intel is leading the industry by providing multiple tools and a roadmap of features, to test processors for correct operation. Intel® In-Field Scan expands on these testing capabilities to improve fleet management by system administrators.
Intel also offers the Intel® Data Center Diagnostic Tool (Intel® DCDiag). Intel® DCDiag is a suite of tests that methodically checks most of the SoC functionality, including that of each individual microprocessor core. By verifying that every DCDIAG computation is correct, and not just confirming that the test completed execution properly, DCDIAG is able to detect many types of faults, including those that manifest as Silent Data Errors. For more information on Intel® DCDiag go to this link.
Intel® In-Field Scan and Intel® DCDiag are complementary test tools. Intel® In-Field Scan is minimally intrusive and designed to quickly test one core, while all the other cores in the node continue to run customer workloads. Intel® DCDiag is a comprehensive processor test suite and is most effective when the entire processing node is dedicated to testing. Because the tools run different test content, Intel has found that each tool identifies different failures across the processors tested.
Note: Not all SKUs of the 5th Gen Intel® Xeon® Processors support Intel® In-Field Scan. Check Product Specification details.