Introducing Intel® Advanced Performance Extensions (Intel® APX)

ID 784404
Updated 10/31/2024
Version Latest
Public

Authors

Sebastian Winkel

Jason Agron

Overview

Intel® architecture powers data centers and personal computers around the world. Since its introduction by Intel in 1978, the architecture has continuously evolved to take advantage of emerging workloads and the relentless pace of Moore’s law (the idea that the number of transistors in an integrated circuit doubles every two years). The original instruction set defined only eight 16-bit general-purpose registers, which doubled in number and quadrupled in size over time. A large set of vector registers was added, and most recently Intel® Advanced Matrix Extensions (Intel® AMX) introduced two-dimensional matrix registers, providing a big jump in AI performance.1

Today, we introduce the next major step in the evolution of Intel architecture. Intel® Advanced Performance Extensions (Intel® APX) expands the entire x86 instruction set with access to more registers and adds new features that improve general-purpose performance. The extensions provide efficient performance gains across a variety of workloads without significantly increasing the silicon area or power consumption of the core.

Features

Intel APX doubles the number of general-purpose registers (GPRs) from 16 to 32. This allows the compiler to keep more values in registers. As a result, code compiled with Intel APX contains 10% fewer loads and more than 20% fewer stores than the same code compiled for an Intel® 64 baseline.2 Register accesses are not only faster, but they also consume significantly less dynamic power than complex load and store operations.

Compiler enabling is straightforward: A new REX2 prefix provides uniform access to the new registers across the legacy integer instruction set. Intel® Advanced Vector Extensions (Intel® AVX) instructions gain access via new bits defined in the existing EVEX prefix. In addition, legacy integer instructions now can also use EVEX to encode a dedicated destination register operand, turning them into three-operand instructions and reducing the need for extra register move instructions. While the new prefixes increase average instruction length, there are 10% fewer instructions in code compiled with Intel APX,2 resulting in similar code density as before.

The new GPRs are XSAVE-enabled, which means that they can be automatically saved and restored by XSAVE/XRSTOR sequences during context switches. They do not change the size and layout of the XSAVE area as they take up the space left behind by the deprecated Intel® Memory Protection Extensions (Intel® MPX) registers.

We propose to define the new GPRs as caller-saved (volatile) states in application binary interfaces (ABIs), facilitating interoperability with legacy binaries. Optimized calling conventions can be introduced where legacy compatibility requirements are relaxed. Generally, more register states will need to be managed at function boundaries. To reduce the associated overhead, we are adding PUSH2/POP2 instructions that transfer two register values within a single memory operation. The processor tracks these new instructions internally and fast-forwards register data between matching PUSH2 and POP2 instructions without going through memory.

Impact

The performance features introduced so far will have a limited impact on workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates the performance of such workloads. Branch predictor improvements can mitigate this only to a limited extent as data-dependent branches are fundamentally hard to predict.

To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for the broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).

Intel APX adds conditional forms of load, store, and compare/test instructions and adds an option for the compiler to suppress the status flag writes of common instructions. These enhancements expand the applicability of if-conversion to much larger code regions, cutting down on the number of branches that may incur misprediction penalties. All of these conditional instruction set architecture (ISA) improvements are implemented through EVEX prefix extensions of existing legacy instructions.

Application developers can take advantage of Intel APX by simple recompilation; source code changes are not expected to be needed. Workloads written in dynamic languages will automatically benefit as soon as the underlying runtime system has been enabled.

Intel APX demonstrates the advantage of the variable-length instruction encodings of x86: New features enhancing the entire instruction set can be defined with only incremental changes to the instruction-decode hardware. This flexibility has allowed Intel architecture to adapt and flourish over four decades of rapid advances in computing, and it enables the innovations that will keep it thriving into the future.

References

Footnotes

  1. Intel® Advanced Matrix Extensions Overview
  2. This projection is based on a prototype simulation of the SPEC CPU® 2017 Integer benchmark suite. SPEC®, SPECrate®, and SPEC CPU® are registered trademarks of the Standard Performance Evaluation Corporation.