Story at a Glance
- On January 12, 2023, LLVM* 15.0.6 was released, continuing Intel’s long history of working with the LLVM community to contribute innovative and performance-enhancing optimizations to the LLVM open source project.
- The LLVM optimizations targeted the newly launched 4th Gen Intel® Xeon® Scalable Processors and Intel® Xeon CPU Max Series (formerly code named Sapphire Rapids) and consist of:
- Instruction Set Architecture (ISA) support for Intel® Advanced Matrix Extensions (Intel® AMX), Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with FP16, Intel® Advanced Vector Extensions (Intel® AVX) with Vector Neural Network Instructions (VNNI), User Interrupts (UINTR), and more
- The new type _Float16, which was extended for all x86 targets
- Enhanced tile automatic configuration and register allocation for Intel AMX intrinsics, stabilizing the programming model
- Introduction of a new attribute, general-regs-only, with UINTR to improve performance in interrupt handling
- Improved performance of byte and short vectors dot product
- Applications that take advantage of the extended new type, interface and intrinsics, and enhanced vectorizations can realize performance gains for workloads in 5G, deep learning, system software, and many more.
Intel® AVX-512 with an FP16 Instruction Set
Intel AVX 512 with FP16 is a comprehensive floating-point instruction set extension for the FP16 data type, comparable to FP32 or FP64 support. It supports the complete arithmetic operations with IEEE 754 binary 16 floating-point type.
Benefit: Compared to FP32 and FP64 floating-point formats, FP16 gives increased execution throughput and reduced storage by reducing the data range and precision. The programmer needs to decide whether the FP16 type is suitable for their applications.
Use Cases:
Users can use the new _Float16 type like other floating-point types such as float and double when the option -march=sapphirerapids is specified. Users can take advantage of vector instructions through either compiler auto-vectorization or hundreds of newly added intrinsics.
An Example of a Scalar Arithmetic Operation
Compiled with the command clang -S -march=sapphirerapids -O2
The example demonstrates how _Float16 can be used like a traditional float or double. AI workloads can benefit from the usability of the new type.
An Example of Compiler Auto-Vectorization
Compiled with the command clang -S -march=sapphirerapids -O2 -mprefer-vector-width=512
Auto-vectorization supports _Float16 type too. Benefitting from the half size compared to a float type, the vector instructions provide improved throughput in the same vector width.
An Example of Using New Intrinsics
Compiled with the command clang -S -march=sapphirerapids -O2
The newly added intrinsics are much like the existing ones in both naming convention and arguments. Three new vector types __m128h, __m256h, and __m512h were introduced for these new intrinsics. To learn more about intrinsics, see the Intel® Intrinsics Guide.
More information of the type and ISA use can be found in the Technology Guide.
_Float16 Support for Targets without the Intel® AVX-512 with FP16 Feature
We expanded LLVM compiler support of the _Float16 type to all modern x86 targets that support Intel® Streaming SIMD Extensions 2 (Intel® SSE2) through software emulation.
Benefit: Users are now able to develop and run their applications with a _Float16 type across various Intel architecture systems even though they don’t support features in Intel AVX-512 with FP16. And they will get more application performance when deployed to a 4th gen Intel Xeon Scalable processor through recompilation.
Use Case: On systems without features in Intel AVX-512 with FP16, the compiler relies on new libgcc (version 12 and above) or compiler-rt (version 14 and above) for type conversion between _Float16 and float. Alternatively, users may use their own libraries. The linker reports a failure if none of these libraries are available.
To accelerate the emulation on previous-generation target systems, we provide vectorization support for Intel-based systems that support F16C. Note that those Intel AVX-512 with FP16 intrinsics are not supported on these previous generation platforms.
An Example of a Scalar Arithmetic Operation
Compiled with the command clang -S -msse2 -O2
With the -msse2 option, the compiler generates a function call to library routine __extendhfsf2 to extend the _Float16 type to the float type, emulates the addition with the float type, and then truncates it back to the _Float16 type through a call to __truncsfhf2.
An Example of Compiler Auto-Vectorization
Compiled with the command clang -S -mf16c -O2
With the -mf16c option, the vectorization is able to take advantage of F16C instructions for type conversion and generates assembly with better performance.
Intel® Advanced Matrix Extensions
Intel AMX is a new 64-bit programming paradigm that consists of two components:
- A set of two-dimensional registers (tiles) that represent subarrays from a larger two-dimensional memory image
- An accelerator able to operate on tiles
The first implementation is called tile matrix multiply unit (TMUL). The details of the Intel AMX ISA can be found in ISA Extensions. Intel AMX helps accelerate the matrix multiply computation, which is widely used in deep learning workloads. Using these new Intel AMX instructions can provide additional performance gains.
In LLVM v13, we supported the Intel AMX programming model that facilitates developers to accelerate the matrix multiply operation.
In LLVM v14, we enhanced the back end to better support the Intel AMX programming model in C/C++ and SYCL*. This enabled SYCL and multilevel intermediate representation (MLIR), which has been adopted by popular deep learning frameworks (such as TensorFlow* and PyTorch*) to extend their languages based on the infrastructure. (A more detailed look at the Intel AMX support in SYCL and its use can be found in LLVM GitHub* from Intel and an IEEE article that goes into implementation details.)
LLVM v15 enhanced the tile autoconfiguration and register allocation, and stabilized the programming model.
Benefit: By using the new Intel AMX programming paradigm, the performance of matrix math operations is greatly accelerated on the CPU for applications such as AI and machine learning.
Use Case: At the core of HPC and AI and machine learning applications is matrix math. The extension is designed for operating on matrices with the goal of accelerating the most prominent use case for the CPU in AI and machine learning, inference, with more capabilities for training.
The following code is an example for Intel AMX use. It produced a dot product for a row of matrix A and a column of matrix B. The result was accumulated to a tile of matrix C.
With the command clang -S -march=sapphirerapids -O2, it is compiled to the following assembly.
The compiler automatically configures the Intel AMX physical registers and the ldtilecfg is hoisted out of the loop so that the tile configure overhead is reduced. At the end of the function, the compiler generated the tile-release instruction to release Intel AMX; this reduces the thread context switch overhead.
The Intel AMX feature was first supported in Linux* kernel v5.16-RC1. On Linux, we need to invoke a syscall to request Intel AMX from the kernel. Also, we need to enlarge the signal stack size since there are an extra 8K bytes for the signal context to save the Intel AMX registers. For details, see Intel AMX in Linux.
The following example shows a common way to initialize Intel AMX in code.
Intel® AVX and Intel AVX-512 with VNNI
Intel AVX with VNNI instructions was added into this generation as a complement to previous Intel AVX-512 with VNNI versions to accelerate convolutional neural network-based algorithms. Also, we enhanced LLVM compiler to automatically generate Intel AVX and Intel AVX 512 with VNNI instructions.
Benefit: By taking advantage of Intel AVX and Intel AVX 512 with VNNI instructions, the performance of the dot product in char or short-int vector is highly improved.
Use Case: Users need to pay attention to the sign of types in the dot product. Due to the native instruction support, the dot product of signed char and unsigned char can achieve the best performance. For other dot products of both signed chars or both unsigned chars, extra sign or zero and sign extensions are generated to extend them to short-int for further dot products. Similarly, only a signed short-int-dot-product can be accelerated on 4th gen Intel Xeon Scalable processors.
In the following example, the compiler can automatically generate an Intel AVX for VNNI instruction (vpdpbusd) from a multiply and sum reduction, with the command clang -S -march=sapphirerapids -O3
The example shows a dot product between a 32-element, unsigned char vector and a 32-element signed char vector. The performance is significantly improved compared to doing it 32 times an unsigned multiply (MUL) in a loop.
User Interrupts (UINTR)
UINTR provides a low-latency event delivery and interprocess (IPC) communication mechanism. These events can be delivered directly to the user space without a transition to the kernel.
Benefit: System software developers can benefit from the efficiency of the interprocedure communication to improve the performance of their workload. For more information, see the benchmarks for IPC and uintr.
Use Case: Starting with LLVM v14, the compiler supports uintr assembly, and the intrinsic and compiler flag -muintr. The -march=sapphirerapids option also enables the UINTR feature. The “(interrupt)” attribute can be used to compile a function as a user-interrupt handler. In conjunction with the ‘-muintr’ flag, the compiler:
- Generates the entry and exit sequences for the UINTR handler
- Handles the saving and restoring of registers
- Calls uiret to return from a user-interrupt handler
UINTR-related compiler intrinsic instructions are declared in <x86gprintrin.h>:
- - _clui() - Clears the user interrupt flag (UIF).
- - _stui() - Sets the UIF.
- - unsigned char _testui() - Stores the current UIF an in unsigned 8-bit integer dst.
- - _senduipi(unsigned __int64 __a) - Sends user interprocessor interrupts specified in unsigned 64-bit integer __a.
The following is an example code for UINTR handler.
With the command clang -S -march=sapphirerapids -O2, it is compiled to the following assembly.
The attribute general-regs-only should be specified because interrupt handler should not clobber SSE registers. Alternatively, we can specify option -mgeneral-regs-only when build this file. uiret instruction is generated by the compiler for the interrupt handler.
LLVM for Today’s and Tomorrow’s Development
The latest optimizations to LLVM v15.0.7 have considerably expanded the open source project, including the benefits Intel compilers offer developers. These and future enhancements will continue to streamline and simplify development and deployment of deep learning applications on current and future Intel architecture. Applications that take advantage of the extended new type, interface and intrinsics, and enhanced vectorizations can realize performance gains for workloads in 5G, deep learning, system software, and many more. The latest LLVM release (15.0.7) and previous releases can be downloaded from LLVM.
Meanwhile, we offer LLVM-based Intel compilers that are able to achieve better performance on Intel hardware through more advanced optimizations. You can experiment with the latest ISA optimizations on the free Intel® Tiber™ AI Cloud, which has the latest Intel hardware and software. Additionally, you can download the latest LLVM-based compilers from Intel at Intel® toolkits.
Explore More
- LLVM
- Get to Know LLVM
- Download the latest LLVM-based compilers from Intel: Intel® oneAPI Base Toolkit, Intel® HPC Toolkit, Intel® oneAPI IoT Toolkit, and LLVM on GitHub.
- Software for New CPUs: Explore workload types, Intel® tools, and other resources to help you take full advantage of instruction sets in these Intel products.
- Benchmarks
- 4th gen Intel Xeon Scalable processors
- Intel Xeon CPU Max Series
Intel® oneAPI Base Toolkit
Develop high-performance, data-centric applications for CPUs, GPUs, and FPGAs with this core set of tools, libraries, and frameworks including LLVM-based compilers.