Story at a Glance
- On January 12, 2023, LLVM* 15.0.6 was released, continuing Intel’s long history of working with the LLVM community to contribute innovative and performance-enhancing optimizations to the LLVM open source project.
- The LLVM optimizations targeted the newly launched 4th Gen Intel® Xeon® Scalable Processors and Intel® Xeon CPU Max Series (formerly code named Sapphire Rapids) and consist of:
- Instruction Set Architecture (ISA) support for Intel® Advanced Matrix Extensions (Intel® AMX), Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with FP16, Intel® Advanced Vector Extensions (Intel® AVX) with Vector Neural Network Instructions (VNNI), User Interrupts (UINTR), and more
- The new type _Float16, which was extended for all x86 targets
- Enhanced tile automatic configuration and register allocation for Intel AMX intrinsics, stabilizing the programming model
- Introduction of a new attribute, general-regs-only, with UINTR to improve performance in interrupt handling
- Improved performance of byte and short vectors dot product
- Applications that take advantage of the extended new type, interface and intrinsics, and enhanced vectorizations can realize performance gains for workloads in 5G, deep learning, system software, and many more.
Intel® AVX-512 with an FP16 Instruction Set
Intel AVX 512 with FP16 is a comprehensive floating-point instruction set extension for the FP16 data type, comparable to FP32 or FP64 support. It supports the complete arithmetic operations with IEEE 754 binary 16 floating-point type.
Benefit: Compared to FP32 and FP64 floating-point formats, FP16 gives increased execution throughput and reduced storage by reducing the data range and precision. The programmer needs to decide whether the FP16 type is suitable for their applications.
Use Cases:
Users can use the new _Float16 type like other floating-point types such as float and double when the option -march=sapphirerapids is specified. Users can take advantage of vector instructions through either compiler auto-vectorization or hundreds of newly added intrinsics.
An Example of a Scalar Arithmetic Operation
_Float16 foo(_Float16 a, _Float16 b) {
return a + b;
}
Compiled with the command clang -S -march=sapphirerapids -O2
foo: # @foo
vaddsh xmm0, xmm0, xmm1
ret
The example demonstrates how _Float16 can be used like a traditional float or double. AI workloads can benefit from the usability of the new type.
An Example of Compiler Auto-Vectorization
void foo(_Float16 *a, _Float16 *b, _Float16 *restrict c) {
for (int i = 0; i < 32; ++i)
c[i] = a[i] + b[i];
}
Compiled with the command clang -S -march=sapphirerapids -O2 -mprefer-vector-width=512
foo: # @foo
vmovups zmm0, zmmword ptr [rdi]
vaddph zmm0, zmm0, zmmword ptr [rsi]
vmovups zmmword ptr [rdx], zmm0
vzeroupper
ret
Auto-vectorization supports _Float16 type too. Benefitting from the half size compared to a float type, the vector instructions provide improved throughput in the same vector width.
An Example of Using New Intrinsics
#include <immintrin.h>
__m512h foo(__m512h a, __m512h b) {
return _mm512_add_round_ph(a, b, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
}
Compiled with the command clang -S -march=sapphirerapids -O2
foo: # @foo
vaddph zmm0, zmm0, zmm1, {rz-sae}
ret
The newly added intrinsics are much like the existing ones in both naming convention and arguments. Three new vector types __m128h, __m256h, and __m512h were introduced for these new intrinsics. To learn more about intrinsics, see the Intel® Intrinsics Guide.
More information of the type and ISA use can be found in the Technology Guide.
_Float16 Support for Targets without the Intel® AVX-512 with FP16 Feature
We expanded LLVM compiler support of the _Float16 type to all modern x86 targets that support Intel® Streaming SIMD Extensions 2 (Intel® SSE2) through software emulation.
Benefit: Users are now able to develop and run their applications with a _Float16 type across various Intel architecture systems even though they don’t support features in Intel AVX-512 with FP16. And they will get more application performance when deployed to a 4th gen Intel Xeon Scalable processor through recompilation.
Use Case: On systems without features in Intel AVX-512 with FP16, the compiler relies on new libgcc (version 12 and above) or compiler-rt (version 14 and above) for type conversion between _Float16 and float. Alternatively, users may use their own libraries. The linker reports a failure if none of these libraries are available.
To accelerate the emulation on previous-generation target systems, we provide vectorization support for Intel-based systems that support F16C. Note that those Intel AVX-512 with FP16 intrinsics are not supported on these previous generation platforms.
An Example of a Scalar Arithmetic Operation
_Float16 foo(_Float16 a, _Float16 b) {
return a + b;
}
Compiled with the command clang -S -msse2 -O2
foo: # @foo
push rax
movss dword ptr [rsp + 4], xmm0 # 4-byte Spill
movaps xmm0, xmm1
call __extendhfsf2@PLT
movss dword ptr [rsp], xmm0 # 4-byte Spill
movss xmm0, dword ptr [rsp + 4] # 4-byte Reload
call __extendhfsf2@PLT
addss xmm0, dword ptr [rsp] # 4-byte Folded Reload
call __truncsfhf2@PLT
pop rax
ret
With the -msse2 option, the compiler generates a function call to library routine __extendhfsf2 to extend the _Float16 type to the float type, emulates the addition with the float type, and then truncates it back to the _Float16 type through a call to __truncsfhf2.
An Example of Compiler Auto-Vectorization
void foo(_Float16 *a, _Float16 *b, _Float16 *restrict c) {
for (int i = 0; i < 8; ++i)
c[i] = a[i] + b[i];
}
Compiled with the command clang -S -mf16c -O2
foo: # @foo
vcvtph2ps ymm0, xmmword ptr [rsi]
vcvtph2ps ymm1, xmmword ptr [rdi]
vaddps ymm0, ymm1, ymm0
vcvtps2ph xmmword ptr [rdx], ymm0, 4
vzeroupper
ret
With the -mf16c option, the vectorization is able to take advantage of F16C instructions for type conversion and generates assembly with better performance.
Intel® Advanced Matrix Extensions
Intel AMX is a new 64-bit programming paradigm that consists of two components:
- A set of two-dimensional registers (tiles) that represent subarrays from a larger two-dimensional memory image
- An accelerator able to operate on tiles
The first implementation is called tile matrix multiply unit (TMUL). The details of the Intel AMX ISA can be found in ISA Extensions. Intel AMX helps accelerate the matrix multiply computation, which is widely used in deep learning workloads. Using these new Intel AMX instructions can provide additional performance gains.
In LLVM v13, we supported the Intel AMX programming model that facilitates developers to accelerate the matrix multiply operation.
In LLVM v14, we enhanced the back end to better support the Intel AMX programming model in C/C++ and SYCL*. This enabled SYCL and multilevel intermediate representation (MLIR), which has been adopted by popular deep learning frameworks (such as TensorFlow* and PyTorch*) to extend their languages based on the infrastructure. (A more detailed look at the Intel AMX support in SYCL and its use can be found in LLVM GitHub* from Intel and an IEEE article that goes into implementation details.)
LLVM v15 enhanced the tile autoconfiguration and register allocation, and stabilized the programming model.
Benefit: By using the new Intel AMX programming paradigm, the performance of matrix math operations is greatly accelerated on the CPU for applications such as AI and machine learning.
Use Case: At the core of HPC and AI and machine learning applications is matrix math. The extension is designed for operating on matrices with the goal of accelerating the most prominent use case for the CPU in AI and machine learning, inference, with more capabilities for training.
The following code is an example for Intel AMX use. It produced a dot product for a row of matrix A and a column of matrix B. The result was accumulated to a tile of matrix C.
TC = {TA1, TA2, TA3} x transpose { TB1, TB2, TB3 }
#include <immintrin.h>
void amx_dp(char *bufa, char *bufb, int *bufc, int tile_nr) {
__tile1024i a = {16, 64};
__tile1024i b = {16, 64};
__tile1024i c = {16, 64};
__tile_zero(&c);
#pragma nounroll
for (int i = 0; i < tile_nr; i++) {
__tile_loadd(&a, bufa + 64*i, 64*tile_nr);
__tile_loadd(&b, bufb + 1024*i, 64);
__tile_dpbssd(&c, a, b);
}
__tile_stored(bufc, 64, c);
}
With the command clang -S -march=sapphirerapids -O2, it is compiled to the following assembly.
amx_dp: # @amx_dp
push rbx
vxorps zmm0, zmm0, zmm0
vmovups zmmword ptr [rsp - 64], zmm0
mov byte ptr [rsp - 64], 1
mov byte ptr [rsp - 16], 16
mov word ptr [rsp - 48], 64
mov byte ptr [rsp - 15], 16
mov word ptr [rsp - 46], 64
mov byte ptr [rsp - 14], 16
mov word ptr [rsp - 44], 64
ldtilecfg [rsp - 64]
mov r8w, 64
mov ax, 16
tilezero tmm0
test ecx, ecx
jle .LBB0_3
mov r9d, ecx
shl ecx, 6
movsxd r10, ecx
xor ecx, ecx
mov r11d, 64
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov ebx, ecx
shl ebx, 6
add rbx, rdi
tileloadd tmm1, [rbx + r10]
mov ebx, ecx
shl ebx, 10
add rbx, rsi
tileloadd tmm2, [rbx + r11]
tdpbssd tmm0, tmm1, tmm2
inc rcx
cmp rcx, r9
jne .LBB0_2
.LBB0_3:
mov ecx, 64
tilestored [rdx + rcx], tmm0
pop rbx
tilerelease
vzeroupper
ret
The compiler automatically configures the Intel AMX physical registers and the ldtilecfg is hoisted out of the loop so that the tile configure overhead is reduced. At the end of the function, the compiler generated the tile-release instruction to release Intel AMX; this reduces the thread context switch overhead.
The Intel AMX feature was first supported in Linux* kernel v5.16-RC1. On Linux, we need to invoke a syscall to request Intel AMX from the kernel. Also, we need to enlarge the signal stack size since there are an extra 8K bytes for the signal context to save the Intel AMX registers. For details, see Intel AMX in Linux.
The following example shows a common way to initialize Intel AMX in code.
#include <err.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/auxv.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/signal.h>
#define fatal_error(msg, ...) err(1, "[FAIL]\t" msg, ##__VA_ARGS__)
#ifndef AT_MINSIGSTKSZ
# define AT_MINSIGSTKSZ 51
#endif
#define XFEATURE_XTILECFG 17
#define XFEATURE_XTILEDATA 18
#define XFEATURE_MASK_XTILECFG (1 << XFEATURE_XTILECFG)
#define XFEATURE_MASK_XTILEDATA (1 << XFEATURE_XTILEDATA)
#define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA)
#define ARCH_GET_XCOMP_PERM 0x1022
#define ARCH_REQ_XCOMP_PERM 0x1023
static void request_perm_xtile_data() {
unsigned long bitmask;
long rc;
rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
if (rc)
fatal_error("XTILE_DATA request failed: %ld", rc);
rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_PERM, &bitmask);
if (rc)
fatal_error("prctl(ARCH_GET_XCOMP_PERM) error: %ld", rc);
if (bitmask & XFEATURE_MASK_XTILE)
printf("ARCH_REQ_XCOMP_PERM XTILE_DATA successful.\n");
}
static void setup_sigaltstack() {
unsigned long minsigstksz, new_size;
void *altstack;
stack_t ss;
int rc;
minsigstksz = getauxval(AT_MINSIGSTKSZ);
printf("AT_MINSIGSTKSZ = %lu\n", minsigstksz);
/*
* getauxval() itself can return 0 for failure or
* success. But, in this case, AT_MINSIGSTKSZ
* will always return a >=0 value if implemented.
* Just check for 0.
*/
if (minsigstksz == 0)
fatal_error("no support for AT_MINSIGSTKSZ");
new_size = minsigstksz * 2;
altstack = mmap(NULL, new_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
if (altstack == MAP_FAILED)
fatal_error("mmap() for altstack");
memset(&ss, 0, sizeof(ss));
ss.ss_size = new_size;
ss.ss_sp = altstack;
rc = sigaltstack(&ss, NULL);
if (rc)
fatal_error("sigaltstack failed: %d", rc);
}
void initialize_amx() {
setup_sigaltstack();
request_perm_xtile_data();
}
Intel® AVX and Intel AVX-512 with VNNI
Intel AVX with VNNI instructions was added into this generation as a complement to previous Intel AVX-512 with VNNI versions to accelerate convolutional neural network-based algorithms. Also, we enhanced LLVM compiler to automatically generate Intel AVX and Intel AVX 512 with VNNI instructions.
Benefit: By taking advantage of Intel AVX and Intel AVX 512 with VNNI instructions, the performance of the dot product in char or short-int vector is highly improved.
Use Case: Users need to pay attention to the sign of types in the dot product. Due to the native instruction support, the dot product of signed char and unsigned char can achieve the best performance. For other dot products of both signed chars or both unsigned chars, extra sign or zero and sign extensions are generated to extend them to short-int for further dot products. Similarly, only a signed short-int-dot-product can be accelerated on 4th gen Intel Xeon Scalable processors.
In the following example, the compiler can automatically generate an Intel AVX for VNNI instruction (vpdpbusd) from a multiply and sum reduction, with the command clang -S -march=sapphirerapids -O3
int usdot_prod_qi(unsigned char *restrict a, char *restrict b, int c) {
int i;
for (i = 0; i < 32; i++) {
c +=a[i] * b[i];
}
return c;
}
Usdot_prod_qi: # @usdot_prod_qi
vmovdqu ymm0, ymmword ptr [rdi]
vpxor xmm1, xmm1, xmm1
{vex} vpdpbusd ymm1, ymm0, ymmword ptr [rsi]
vextracti128 xmm0, ymm1, 1
vpaddd xmm0, xmm1, xmm0
vpshufd xmm1, xmm0, 238 # xmm1 = xmm0[2,3,2,3]
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
add eax, edx
vzeroupper
ret
The example shows a dot product between a 32-element, unsigned char vector and a 32-element signed char vector. The performance is significantly improved compared to doing it 32 times an unsigned multiply (MUL) in a loop.
User Interrupts (UINTR)
UINTR provides a low-latency event delivery and interprocess (IPC) communication mechanism. These events can be delivered directly to the user space without a transition to the kernel.
Benefit: System software developers can benefit from the efficiency of the interprocedure communication to improve the performance of their workload. For more information, see the benchmarks for IPC and uintr.
Use Case: Starting with LLVM v14, the compiler supports uintr assembly, and the intrinsic and compiler flag -muintr. The -march=sapphirerapids option also enables the UINTR feature. The “(interrupt)” attribute can be used to compile a function as a user-interrupt handler. In conjunction with the ‘-muintr’ flag, the compiler:
- Generates the entry and exit sequences for the UINTR handler
- Handles the saving and restoring of registers
- Calls uiret to return from a user-interrupt handler
UINTR-related compiler intrinsic instructions are declared in <x86gprintrin.h>:
- - _clui() - Clears the user interrupt flag (UIF).
- - _stui() - Sets the UIF.
- - unsigned char _testui() - Stores the current UIF an in unsigned 8-bit integer dst.
- - _senduipi(unsigned __int64 __a) - Sends user interprocessor interrupts specified in unsigned 64-bit integer __a.
The following is an example code for UINTR handler.
#include <unistd.h>
#include <x86gprintrin.h>
unsigned int uintr_received;
void
__attribute__((interrupt))
__attribute__((target("general-regs-only")))
uintr_handler(struct __uintr_frame *ui_frame,
unsigned long long vector) {
static const char print[] = "Received User Interrupt handler\n";
write(STDOUT_FILENO, print, sizeof(print) - 1);
uintr_received = 1;
}
With the command clang -S -march=sapphirerapids -O2, it is compiled to the following assembly.
uintr_handler(__uintr_frame*, unsigned long long):
push rax
push r11
push r10
push r9
push r8
push rdi
push rsi
push rdx
push rcx
push rax
push rax
cld
lea rsi, [rip + uintr_handler(__uintr_frame*, unsigned long long)::print]
mov edx, 30
mov edi, 1
call write@PLT
mov dword ptr [rip + uintr_received], 1
add rsp, 8
pop rax
pop rcx
pop rdx
pop rsi
pop rdi
pop r8
pop r9
pop r10
pop r11
add rsp, 16
uiret
uintr_received:
.long 0 # 0x0
The attribute general-regs-only should be specified because interrupt handler should not clobber SSE registers. Alternatively, we can specify option -mgeneral-regs-only when build this file. uiret instruction is generated by the compiler for the interrupt handler.
LLVM for Today’s and Tomorrow’s Development
The latest optimizations to LLVM v15.0.7 have considerably expanded the open source project, including the benefits Intel compilers offer developers. These and future enhancements will continue to streamline and simplify development and deployment of deep learning applications on current and future Intel architecture. Applications that take advantage of the extended new type, interface and intrinsics, and enhanced vectorizations can realize performance gains for workloads in 5G, deep learning, system software, and many more. The latest LLVM release (15.0.7) and previous releases can be downloaded from LLVM.
Meanwhile, we offer LLVM-based Intel compilers that are able to achieve better performance on Intel hardware through more advanced optimizations. You can experiment with the latest ISA optimizations on the free Intel® Developer Cloud, which has the latest Intel hardware and software. Additionally, you can download the latest LLVM-based compilers from Intel at Intel® toolkits.
Explore More
- LLVM
- Get to Know LLVM
- Download the latest LLVM-based compilers from Intel: Intel® oneAPI Base Toolkit, Intel® HPC Toolkit, Intel® oneAPI IoT Toolkit, and LLVM on GitHub.
- Software for New CPUs: Explore workload types, Intel® tools, and other resources to help you take full advantage of instruction sets in these Intel products.
- Benchmarks
- 4th gen Intel Xeon Scalable processors
- Intel Xeon CPU Max Series
Intel® oneAPI Base Toolkit
Develop high-performance, data-centric applications for CPUs, GPUs, and FPGAs with this core set of tools, libraries, and frameworks including LLVM-based compilers.