This paper discusses the new features and enhancements available in the next generation Intel Xeon processor Scalable family previously with the codename of Icelake and how developers can take advantage of them. The 10nm processor provides core microarchitecture changes and new technologies including the second generation of Intel® Optane™ DC persistent memory, enhanced I/O, hardware-enhanced security, vector bit manipulation instructions, next generation PCIe and new Intel® Speed Select Technology capabilities. Data center workloads will benefit from the increased core count, increased instructions per cycle, fast UPI, increased processor cache sizes, fast memory, and more PCI Express lanes.
The table below provides a comparison between the second and third generation of the Intel Xeon processor Scalable family. The third generation builds on features found in the second generation. New capabilities or enhanced pre-existing features relative to the previous generation are in Bolded Italic.
Table 1. The Next-Generation Intel Xeon processor Scalable Family Microarchitecture Overview
|
Second Generation Intel Xeon processor Scalable family with Intel® C620 series chipset |
Third Generation Intel Xeon processor Scalable family with Intel® C620A series chipset |
---|---|---|
Socket Count | 1, 2, 4, 8 | 1 and 2 |
Die Size | 14nm | 10nm |
Processor Core Count |
Up to 28 cores (56 cores with Intel® Hyper-Threading Technology (Intel® HT Technology)) per socket |
Up to 40 cores (80 cores with Intel® Hyper-Threading Technology (Intel® HT Technology)) per socket |
Cache | First Level Cache: 32 KB Instruction Cache 32 KB Data Cache Mid-Level Cache: 1 MB private per core Last Level Cache: 1.375 MB per core | First Level Cache: 32 KB Instruction Cache 48 KB Data Cache Mid-Level Cache: 1.25 MB private per core Last Level Cache: 1.5 MB per core |
TDP | Up to 205W | Up to 270W |
New Features |
Intel® Resource Director Technology (Intel® RDT), Intel® SST-BF on select models, Intel® Volume Management Device (Intel® VMD) 1.0, Intel® VROC 6.0, Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Deep Learning Boost (Intel® DL Boost), Physical Addressing Bits: 46, Virtual Addressing Bits: 48, Intel® SST-BF on select models, Intel® Speed Select Technology - Performance Profile (Intel® SST-PP) on select models |
Intel® Resource Director Technology (Intel® RDT), Intel® SST-BF on select models, Intel® Volume Management Device (Intel® VMD) 2.0, Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Deep Learning Boost (Intel® DL Boost), VBMI, Physical Addressing Bits: 52, Virtual Addressing Bits: 57, Intel® Speed Select Technology - Core Power (Intel® SST-CP), Intel® Speed Select Technology - Turbo Frequency (Intel® SST-TF) and enhanced Intel® Speed Select Technology - Performance Profile (Intel® SST-PP) on select models, Intel® VROC 7.5, Intel® PFR, Converged Boot Guard and Intel TXT, Crypto Enhancements: (2xAES, SHA Extensions, VPMADD52), Intel® Software Guard Extensions (Intel® SGX), Intel® Total Memory Encryption (Intel® TME), Intel® Volume Management Device 2.0, PECI 4.0, enhanced power management features. |
Socket Type | Socket P |
Socket P+ |
Memory controllers / Sub-Numa clusters | 2 / 2 | 4 / 2 |
Memory DDR4 | Up to 6 channels DDR4 per CPU, up to 256GB DIMM capacity, up to 12 DIMMs per socket, up to 2666 MT/s 2DPC, up to 2933 MT/s 1DPC | Up to 8 channels DDR4 per CPU, up to 256GB DIMM capacity, up to 16 DIMMs per socket, up to 3200 MT/s 2DPC |
Number of Intel® Ultra Path Interconnect (Intel® UPI) Interconnects |
Up to 3 links per CPU |
|
Intel® UPI Interconnect Speed | Up to 10.4 GT/s | Up to 11.2 GT/s |
PCIe |
PCIe Gen 3: up to 48 lanes per CPU (bifurcation support: x16, x8, x4) |
PCIe Gen 4: up to 64 lanes per CPU (bifurcation support: x16, x8, x4) |
Chipset Features |
Intel® QuickAssist Technology (Intel® QAT) |
|
Up to 14 SATA 3, up to 14 USB 2.0, up to 10 USB 3.0 |
Figure 1 – Intel® UPI topology comparison between second and third generation Intel Xeon processor Scalable family on a two socket platform
Enhanced - Asynchronous DRAM Refresh (eADR)
In the first generation of persistent memory a feature known as ADR (Asynchronous DRAM Refresh) causes a hardware interrupt on the memory controller that flushes the write protected data buffers. This feature helps to ensure that any data that can make it to the write pending queue (WPQ) on the memory controller will make it into persistent memory. This is done to protect data in the event of a power failure. ADR protects data only at the memory subsystem level it does not flush the processor caches. Cache flushes need to be done by applications using either the CLWB, CLFLUSH, CLFLUSHOPT, Non-Temporal Stores, or WBINVD machine instructions. This functionality still exists with the second generation of persistent memory, but an additional new feature has been added called eADR.
eADR extends the protection from the memory subsystem to the processor caches in the event of a power failure. An NMI routine needs to be initiated to flush the processor caches which can then be followed by an ADR event. Applications using the Persistent Memory Development Kit (PMDK) will detect if eADR is present and do not need to perform flush operations. An SFENCE operation is still required which maintains persistence for globally visible stores. eADR does require that the OEM provide additional stored energy such as a backup battery to specifically allow for this functionality.
Intel® Optane™ persistent memory management tools and resources
Linux Management Tools for Intel® Optane™ DC Persistent Memory
ndctl - Manage the Linux LIBNVDIMM kernel subsystem.
pmemcheck - Perform a dynamic runtime analysis with an enhanced version of Valgrind.
FIO - Run benchmarks with FIO.
pmembench - Build and run PMDK benchmarks.
Windows Server manageability is provided by Microsoft through PowerShell.
Persistent Memory Development Zone
Intel® Speed Select Technology (Intel® SST)
[embed]6246300771001[/embed]
Video 1 - Intel Speed Select Technology Overview
Often in the cloud service provider and enterprise environments multiple servers are purchased to handle diverse workloads and usages. This can increase total cost of ownership due to power consumption, complexity in management, ensuring all the systems are fully utilized, etc. Intel Speed select technology is meant to provide a collection of features that help improve this situation by providing a way to prioritize processor attributes.
Four variations of the technology can be found on the latest generation of Intel® Xeon® Scalable processors. This includes Intel® Speed Select Technology-Turbo Frequency (Intel® SST-TF), Intel® Speed Select Technology-Core Power (Intel® SST-CP), Intel® Speed Select Technology - Performance Profile (Intel® SST-PP), and Intel® Speed Select Technology - Base Frequency (Intel® SST-BF). Their availability is limited to the network focused processor models and can be found on the gold 5300 and higher models. The features are discoverable using software tools enabled for Linux.
Intel® Speed Select Technology-Turbo Frequency (Intel® SST-TF)
Intel SST-TF is built upon the foundation of Intel SST-CP. Thus there is a relationship between the two features. Intel SST-TF allows some of the cores to be tagged as high priority providing them with a turbo frequency that exceeds the nominal turbo frequency limits when all cores are active. The overall frequency envelope for the processor socket stays the same but how much frequency a specific core gets can deviate from the specified all core turbo frequency values. All the cores on the processor socket will remain active but the frequency envelope will shift providing higher frequencies to high priority cores, while low priority cores will drop proportionally in frequency to compensate. This feature can be turned on, off or adjusted on a per core basis at runtime.
A potential use case for this feature would be to combine it with Global Extensible Open Power Management (GEOPM). GEOPM is an open source power management runtime and framework that is focused on improving performance and efficiency of HPC workloads. Typically, when an HPC workload schedules a job it will do so in a serialized fashion. Intel SST can adjust the turbo frequency of that single thread beyond its normal limit to maximum performance even when C-states are disabled on the platform.
Figure 2. Illustration of single high priority core receiving additional turbo frequency when a surplus of turbo frequency is available.
When the workload shifts into parallel hardware threads during the execution phase Intel SST and GEOPM can also provide a benefit. Invariably some of the parallel threads will be running slower than others, these threads can be given additional turbo frequency as needed.
Similar to GEOPM, Intel SST-TF technology can be applied and exercised in any asymmetric or heterogenous workload scenario. Intel SST-TF can also help in traditional cloud environments where certain virtual machines requiring higher service-level agreements can be scheduled to run on Intel SST-TF enabled high priority cores while the rest of the cores can be scheduled with general purpose virtual machines.
Intel® Speed Select Technology - Core Power (Intel® SST-CP)
Figure 3. Illustration showing an overview of Intel SST-CP in operation
Intel SST-CP allows configuring cores into priority groups with each group identified with a specific class of service level characterized by a priority weight, minimum and maximum frequencies. When there is a surplus of processor power/frequency available, the power control unit (PCU) will distribute the surplus amongst the available cores. The surplus of processor power/frequency available is distributed based on the defined weights that have been assigned to the cores. In figure x you can see that the higher priority core receives the surplus frequency first. When Intel SST-CP is disabled the PCU will simply distribute the surplus of frequency to the various cores without any concern for prioritizing the work being done by a given core.
In order to optimize the result with Intel SST-CP considerations will need to be made for prioritized cores. For example, spreading out the physical location of high priority cores across NUMA nodes on a processor socket will help to minimize thermal issues. This feature will require workload-based tuning.
Communications based workloads are moving towards more virtual network functions. For use cases like this some of the higher performing virtual functions can end up forming bottlenecks on just a few cores. Intel SST-CP can help alleviate pressure on the stressed cores and enhance the usage of those virtual network functions.
Intel® Speed Select Technology - Performance Profile (Intel® SST-PP)
Fig 4. Intel SST-PP graph representing three different profiles and associated vectors. Frequency, core count and TDP for illustration only
Intel SST-PP allows the processor socket to be divided up into three different profiles with each profile can have a different set of processor attributes. The following attributes on the latest generation of Intel Xeon Scalable processors are part of the profile structure: number of active cores, core mask, Streaming SIMD Extensions (SSE) base frequency, Intel© Advanced Vector Extensions 2 (Intel© AVX2) frequency, uncore frequency, memory frequency, Thermal Design Power (TDP) and temperature. The user has the ability set the profiles up at boot time within the BIOS or defining the profiles at runtime.
This feature allows one processor model to provide multiple variations to meet the requirements of multiple SLAs or optimize the hardware for multiple workloads. A specific use case might be a media studio that needs to use servers for Virtual Desktop Infrastructure (VDI) during the morning and afternoon while rendering at night. VDI’s typically require fewer cores operating at higher performance. Whereas rendering applications typically perform better with many cores running in parallel. Intel SST-PP allows for the server to be reconfigured to provide a more optimized hardware configuration throughout the entire day. The feature might also be of benefit to qualification costs for validation work by allowing for testing on fewer systems.
Intel® Speed Select Technology – Base Frequency (Intel® SST-BF)
Figure 5. Intel SST-BF example where base frequency is divided into two different priority groups and how this alteration relates to the standard base frequency without Intel SST-BF. Core count and frequency for illustration only.
Intel SST-BF lets you control and direct the base SSE frequency of the processor cores into two different levels of priority. You can apply a higher base SSE frequency to a set of cores that you deem to have higher priority. These cores can serve your most critical workloads. While a secondary group of processor cores can be associated with a lower priority and lower base SSE frequency. This allows you to have finer control over your platform to deal more dynamically with your usage model when you need it. Network function virtualization is a possible application for this feature, where you might want to assign a load balancer to the higher priority cores and a router to the lower priority cores. In the latest generation of Intel Xeon Scalable processors this feature is fully runtime capable.
Enabling Intel® Speed Select Technology
[embed]6246300766001[/embed]
Video 2 - Intel Speed Select Technology Configuration
Configuration can be done through third-party orchestration software utilizing the Intel Speed Select Technology tool and driver that’s part of the latest Linux kernel tools repository. It can also be done by directly interfacing with the hardware exposed register interfaces. While GitHub provides ease of provisioning with Intel SST-BF via python scripts.
Intel Speed Select Technology tools and resources
Intel® Speed Select Technology – Base Frequency Priority CPU Management for Open vSwitch (OVS)
Intel® Speed Select Technology – Base Frequency - Enhancing Performance
Intel® Speed Select Technology - Base Frequency (Intel® SST-BF) with Kubernetes - deployment guide
Intel® Speed Select Technology - Base Frequency (Intel® SST-BF) – source code / API for splitting CPU cores into shared or dedicated sets on Openstack
Intel® Speed Select Technology - Base Frequency (Intel® SST-BF) – enabling guide (Bios and OS)
Intel® Virtual RAID on CPU (Intel® VROC)
Figure 6. Intel® VROC replaces RAID add-on cards
Intel VROC is a software solution that integrates with a hardware technology called Intel® Volume Management Device (Intel® VMD) to provide a compelling hybrid RAID solution for NVMe (Non-Volatile Memory Express) solid-state drives (SSDs). The CPU has onboard capabilities that work more closely with the chipset to provide quick access to the directly attached NVMe SSDs on the PCIe lanes of the platform. Since Intel VROC is an integrated RAID solution leveraging technologies within the HW of the platform, features like hot insert and Bootable RAID are available even if the OS doesn’t provide it. This robust NVMe ecosystem with RAID and SSD management capabilities is a compelling alternative to RAID HBAs, therefore helping improve platform BOM costs and better preparing users to move to NVMe SSDs.
Intel VMD is a technology designed primarily to improve the management of high-speed SSDs. Previously SSDs were attached to a SATA or other interface types and managing them through software or a discrete HBA was acceptable. When the industry moves toward faster NVMe SSDs over a PCIe interface in order to improve bandwidth, the discrete HBA adds delays and bottlenecks while a pure software solution is incomplete for most enterprise users. Intel VMD with Intel VROC uses hardware and software together to mitigate these issues.
New enhancements include support for bootable RAID1 for PCH-attached NVMe SSDs, support for self encrypted drive (UEFI only), optimize RAID stripe size for QLC SSD with large Indirection Unit, integrated caching optimized for Intel® Optane™ SSD (Linux only) which is powered by Open CAS and an open source caching driver.
Intel VMD and Intel VROC have been around since the first generation of Intel Xeon Scalable processors. Additional information can be found on the Intel VROC website or the Intel VROC support site.
Intel® Platform Firmware Resilience (Intel® PFR)
Intel Platform Resilience is designed to protect, detect and correct against security threats such as permanent denial of service attacks. In a PDOS, the hardware is attacked with the intent to render the system permanently inoperable, such as by corrupting the system firmware in a manner that is not recoverable. This is a growing threat against critical infrastructure systems such as those associated with the power grid, banks, and other utilities.
Intel PFR uses a built in Intel® MAX® 10 FPGA to improve protection against security threats. The FPGA along with soft-IP is used as the primary root of trust. The soft-IP enables visibility and flexibility in the design allowing for optional customizations to deal with changes in hardware, firmware or customer needs. This flexibility is of value for example when switching to a different bios chip manufacturer. The FPGA helps protect the firmware by attesting that it is safe prior to executing the code. It also engages in boot and runtime monitoring to assure the server is running known good firmware for various aspects of the system such as the BIOS, BMC, Intel ME, SPI Descriptor and the firmware on the power supply. One of the more interesting aspects is that the FPGA can provide for an automated recovery if corrupted firmware is detected, which previously required manual intervention.
Intel PFR meets the NIST 800-193 specification for firmware resiliency. Support for the feature is included in the Intel® Security Libraries for Data Center (Intel® SecL - DC).
Intel® Security Libraries for Data Center (Intel® SecL - DC)
Intel SecL - DC consists of software components providing end-to-end cloud security solutions with integrated libraries. Users have the flexibility to either develop their customized security solutions with the provided libraries or deploy the software components in their existing infrastructure.
There are new security features that are supported in Intel SecL - DC. Through the Platform Control Registers (PCR) within the Trusted Platform Module (TPM), the remote attestation functionality can be extended to include files and folders on a Linux host system and are included in determining the host’s overall trust status. Lastly, virtual machines and Docker container images can be encrypted at rest, with key access tied to platform integrity attestation. Because security attributes contained in the platform integrity attestation report are used to control access to the decryption keys, this feature provides protection for at-rest data, IP, code, in Docker container or virtual machine images.
Memory Encryption
Intel© Total Memory Encryption (Intel© TME) is a new security enhancement available with the third generation Intel Xeon processor Scalable Family. This feature protects the DDR4 platform memory against hardware attacks such as cold boot, freeze spray or DIMM removal. It is enabled directly in the system BIOS with a single CPU-generated key allowing all of the system memory to be encrypted. This feature is a turn-key security solution that is enabled via the BIOS and does not require any software enabling.
Intel® Total Memory Encryption – Multi-Tenant (Intel® TME-MT) is an additional new enhancement to help virtual machine managers separate and encrypt individual virtual machines or containers. Each virtual machine is able to cryptographically isolate itself from other virtual machines using AES 128-XTS. Up to 64 virtual machines can be protected in this manner. This feature is enabled in the BIOS and requires OS/VMM support.
Intel® Software Guard Extensions (Intel® SGX)
Fig 7. Block diagram of Intel SGX providing an encrypted memory space to help protect data.
Intel SGX is a set of instructions that increases the security of application code and data, giving them more protection from unauthorized disclosure or modification. Developers can partition sensitive information into enclaves, which are areas of execution in encrypted memory with more security protection.
Fig 8. Block diagram showing encrypted memory spaces (enclaves) residing within applications that are protected from unauthorized access by various levels of software
Use cases for Intel SGX include encrypted database operations, running unmodified applications within an enclave, protection of keys on the local files system, and enabling multi-party joint computation on sensitive data in a privacy preserving manner.
From a programming perspective there are two ways that you can enable Intel SGX with an application. You can create the enclave inside the application, which will require modification to the application in order to utilize the enclave. The second way is to place the application, sensitive data, and any optional operating system components into the enclave. This second option is better suited for ease of adoption without having to modify an application. Intel SGX is a well-researched and tested solution that has been around for a while and has developed tools and libraries to make adoption easier.
Intel SGX has been enhanced for the third generation Intel Xeon processor Scalable Family this includes an enclave size up to 512GB per socket to help support data center workloads. Developers can control which enclaves can be launched in addition to having full control over Attestation stack. Lastly it is compatible with applications written on other variants of Intel SGX and is compatible with Intel solutions such as Intel SecL – DC and Intel© Datacenter Attestation Primitives. It should be noted that Intel SGX on the third generation Intel Xeon processor Scalable Family requires that Intel TME and Intel TME-MT are enabled in the BIOS.
Memory Capacity Expansion
The physical and virtual address limits have increased on the next generation Intel Xeon processor Scalable family. This change allows much larger memory addressability for both physical addresses and virtual machine guest addresses. The previous generation of architecture only supported 46 bit physical addresses and 48 guest virtual addresses via a 4 level extended page tables (EPT). The EPT now is capable of 5-levels of paging and an increased linear address width to 52 bits for physical addressing and 57 bits for virtual addressing. From a software perspective this enhancement requires support from the operating system or the virtual-machine monitor. While applications will typically benefit from the feature without any need for modification, unless they are coded with a 48 bit pointer limit. The 5-Level Paging and 5-Level EPT whitepaper provides additional insights into this addressability change.
Vector Bit Manipulation Instructions (VBMI)
In-memory databases typically need to optimize the amount of memory space they occupy. In order to achieve the smallest footprint, they utilize data compression. The downside to this approach is that compression and decompression has to occur when accessing the data in the database. VBMI are new instructions supported by the latest generation of Intel Xeon Scalable processors as part of Intel© Advanced Vector Extensions (Intel© AVX-512). These instructions have been introduced to help improve the decompression and compression aspects for in-memory databases and loss-less file compression. Typically, data sizes used by these applications are 8, 16, 32 or 64 bits. VBMI is able to convert the data into chunks to the power of two. This provides in line compression while feeding small data chunks to algorithms for immediate operation. There are no additional enabling requirements if Intel© AVX-512F is already existing. For ease of adoption using general compression and decompression libraries use the Intel® Integrated Performance Primitives (Intel® IPP) Library.
Cryptographic Acceleration
Intel has longstanding practice of enhancing cryptographic cipher performance through the introduction of new instructions, improvements to processor microarchitecture, and innovative algorithm optimizations. Intel’s latest version of Intel Xeon Scalable Processor is no exception and provides performance improvements for a variety of widely adopted public key cryptographic ciphers, symmetric ciphers, and cryptographic hash algorithms.
Public Key cryptography is a category of cryptographic ciphers widely used for authentication and key exchange when establishing a secure TLS connection between two systems. Based on large integer math, the ciphers require compute intensive multiplication and squaring primitives to support the cryptographic algorithm. The new AVX512 Integer Fused Multiply Add (IFMA) VPMADD52 instructions support efficient large number multiplication operations with a four-fold increase in parallelism over previous architectures. RSA, ECDSA, and ECDHE public key cryptographic cipher performance can be improved incorporating these instructions into your specific algorithm computation primitives. When these instructions are used in combination with cryptographic multi-buffer processing techniques can yield significant performance improvements.
Advanced Encryption Standard (AES) symmetric ciphers can be optimized to take advantage of Vectorized AES-NI. When used with 512-bit wide registers, one can process up to four AES 128-bit blocks per instruction delivering a significant improvement in bulk encryption throughput in a variety of modes (Ex. AES-GCM).
Cryptographic hash algorithms benefit from the introduction of Vectorized Carryless Multiply (CLMUL) and SHA-Extensions being added to the architecture. Vectorized CLMUL provides significant throughput gains for processing Galois Hash (GHASH) and with specific instructions added to support SHA-256, performance is significantly improved over previous Intel Xeon Scalable processor architectures. These new instructions are compatible with Data Plane Development KIT DPDK, Intel OpenSSL Engine, Intel Storage Acceleration library (ISAL), IPSec Multi-Buffer Library and IPP Multi-Buffer Library.
Intel® Resource Director Technology (Intel® RDT)
The set of technologies in the Intel® Resource Director Technology (Intel® RDT) are designed to help monitor and manage shared resources. Intel RDT already has several existing features from previous processor generations that provide benefits such as Cache Monitoring Technology (CMT), Cache Allocation Technology (CAT), Memory Bandwidth Monitoring (MBM), and Code Data Prioritization (CDP).
In the third generation Intel Xeon processor Scalable family, the Memory Bandwidth Monitoring (MBM) has been enhanced. MBM provides the capability to examine memory bandwidth activity on a per thread, application or VM basis. This can help to profile the memory bandwidth usage of various software running on the platform. It also helps to determine which aspects have a higher demand for the memory bandwidth versus others. In the previous iteration of this feature the resolution counter was limited to 24 bits, while this latest generation of Intel Xeon Scalable processors now support up to 32 bits. The 32 bit counter takes much longer before an overflow condition to occur so you can read the counters less often without the risk of losing data. Normally with the 24 bit counter reads would occur less than a minute apart while the 32 bit counter allows for reads to happen every few hours.
Intel RDT is compatible with Intel® VTune™ Profiler and Intel-CMT-CAT.
The Author: David Mulnix is a software engineer and has been with Intel Corporation for over 20 years. His areas of focus include software automation, server power, and performance analysis, and he has contributed to the development support of the Server Efficiency Rating ToolTM.
Contributors: Vasudevan Srinivasan, Dan Zimmerman, Kartik Ananth, Andy Rudoff, John Mechalas and Khawar Abbasi
Notices/Disclaimers
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.