By: Fabian Boemer
At Intel, our commitment and responsibility is to deliver technologies and products that enable data to be computed securely and privately, from the edge to the cloud. Throughout the compute stack, Intel technologies such as Intel® Software Guard Extensions (Intel® SGX) and optimized cryptography libraries on our recently-announced 3rd generation Intel® Xeon® processor Scalable family helps preserve workload confidentiality and data security.
Continued improvements in algorithmic and computational technologies have enabled additional approaches to data security, particularly in the last decade. While the idea of Homomorphic Encryption (HE) has been around since the 1970s, the substantial runtime overhead of performing computation on encrypted data has prevented HE from being a performant solution in practice until recently.
Intel has been active in the HE community for several years. We recently released the Intel® Homomorphic Encryption Toolkit (Intel® HE Toolkit), which enables optimized HE performance on the latest Intel platforms. Intel is collaborating with NASDAQ* to accelerate their HE research and development. We’re also partnering with Microsoft* in the Defense Advanced Research Projects Agency (DARPA) Data Protection in Virtual Environments (DPRIVE) program to develop an accelerator for fully homomorphic encryption.
Today, we’d like to highlight the Intel Homomorphic Encryption Acceleration Library (Intel HEXL). Intel HEXL is an open source software library which takes advantage of optimizations in the latest Intel® Advanced Vector Extensions 512 (Intel® AVX-512) acceleration instructions to accelerate the key cryptography primitives in HE workloads. Since the initial release of Intel HEXL in April 2021, we’ve worked with Microsoft to integrate Intel HEXL into the Microsoft SEAL HE library and Duality Technologies* to integrate Intel HEXL into the PALISADE HE library. For optimized performance, Intel HEXL utilizes the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Integer Fused Multiply Add Instructions (Intel® AVX-512-IFMA) available in the latest 3rd generation Intel® Xeon® processor Scalable family. This article provides a quick insight of how Intel HEXL takes advantage of Intel® AVX-512-IFMA to accelerate HE . For more details, see our technical paper.
Homomorphic Encryption Workloads
Before diving into Intel HEXL and Intel AVX-512-IFMA, we’ll first give a brief background in HE workloads. Many HE schemes are based on computation in the finite polynomial quotient ring Rq = Zq[X] / (XN + 1), where q is an integer known as the coefficient modulus and N is known as the polynomial modulus degree. To understand what this means, let’s parse this expression one step at a time.
- Given an integer q, known as the coefficient modulus, the finite field Zq is the set of integers {0, 1, …, q – 1}, where addition and multiplication are performed modulo q. For instance, in the finite field Z4 = {0, 1, 2, 3}, 2 + 3 mod 4 = 1.
- Now, Zq[X] is simply the set of polynomials with integer coefficients in Zq. So, the polynomial 1+2x+3x2 is in Z4[X], but the polynomial 4x is not, since the coefficient 4 is not in Z3.
- Now, when we add two polynomials of a given degree n in Zq[X], the result is a polynomial degree n. However, when we multiply two polynomials, the degree of the product is 2n.
- The denominator (XN + 1) means that whenever you multiply two polynomials in Zq[X], and the product exceeds degree N, we divide the product by the polynomial XN+1, and take the result to be the remainder.
There are a few aspects of computation in Rq that differentiate HE from other workloads:
- The underlying math behind HE computation is integer-based. So any improvements to floating-point arithmetic, such as the bfloat16 floating-point format, are not directly useful to accelerating HE workloads.
- The polynomials in Rq are very large; the degree is typically a power of two ranging from N=1024-32768, and each coefficient can be up to hundreds of bits.
- Modular computation is not natively supported on typical hardware architectures, and requires a distinct set of techniques to accelerate.
To summarize, performing HE workloads requires an enormous amount of modular integer computation. This is where Intel AVX-512-IFMA shines.
Intel AVX-512-IMFA
Intel HEXL utilizes two instructions in Intel AVX-512-IFMA, vpmadd52luq, and vpmadd52huq. Intel AVX-512 is a Single Instruction Multiple Data (SIMD) instruction set which enables simultaneous computation on 512 bits of data at a time (which can represent, as in Intel HEXL, eight 64-bit integers). So for instance, we might have an Intel AVX-512 integer variable x, which stores the vector of elements [1, 2, 3, 4, 5, 6, 7, 8], and another variable y, which stores the vector of elements [9, 10, 11, 12, 13, 14, 15, 16]. The vpaddq instruction will simply add these two vectors element-wise, so vpaddq(x, y) yields the vector of elements [10, 12, 14, 16, 18, 20, 22, 24]. Simple enough.
But if we want to multiply integers in a SIMD fashion, we run into a problem: Given two 64-bit integers, the product is up to 128 bits. To prevent data expansion, we need to condense the result to at most 64 bits. Two natural choices would be to either take the low 64 bits of the result, or take the high 64 bits of the result. In fact, the vpmullq instruction implements the first choice, returning the low 64 bits of the result in each of the eight 64-bit elements. There is no instruction in Intel AVX-512 that returns the high 64 bits of the result.
Both vpmadd52luq and vpmadd52huq make a different choice. They first assume the input 64-bit elements are actually, at most, 52 bits. Then, the product of two 52-bit numbers yields a 104-bit number, which is still too large to store in a 64-bit result. So vpmadd52luq returns the low 52 bits of the result, and vpmadd52huq returns the high 52 bits of the result. These instructions also perform a following addition with a 64-bit integer. For our purposes, we can ignore the add step (which can be achieved by setting the value to add to zero). So what makes vpmadd52luq and vpmadd52huq useful for HE?
Number-theoretic Transform (NTT)
One of the key kernels in HE workloads is the number-theoretic transform (NTT). The NTT is the finite field equivalent to the fast Fourier transform (FFT), a common algorithm used in signal processing and other fields. So if you took all the additions and multiplications in an integer FFT and replaced them with modular addition and modular multiplication, you’d end up with the NTT. The NTT is widely used to multiply two polynomials in Rq efficiently. While the regular schoolbook multiplication is extremely slow due to the large polynomial degrees and the O(N2) runtime, the NTT reduces the complexity to O(N log N) for a much faster runtime. Note, the multi-word modulus q is often chosen to be a product of co-prime moduli, which enables use of the residue number system (RNS) to decompose the multi-word arithmetic to arithmetic on each word-sized prime. Hence, we focus on word-sized moduli q. Multiplication in Rq is much more computationally expensive than addition, so the NTT becomes a major bottleneck for many HE workloads. Given two polynomials f(x), g(x) in Rq, we can multiply them by computing InvNTT(FwdNTT(f(x)) * FwdNTT(g(x)))), where “*” is element-wise modular multiplication, FwdNTT is the forward transform, and InvNTT is the inverse transform. The NTT has several recursive formulations which lead to a variety of implementations and flavors. The final recursive case is known as the “butterfly,” which typically performs the NTT on a small number of elements, such as two in Intel HEXL.
Algorithm 3 and Algorithm 4 from David Harvey’s Faster Arithmetic for Number-Theoretic Transforms show the butterflies for a particular variant of the NTT used in homomorphic encryption.
Here, the butterfly performs an NTT on two elements, X and Y. The finite field modulus is p, which is assumed to be word-sized as well. W and W’ are some precomputed factors based on p.
In Harvey’s work, the machine word size β is typically 232 or 264.
How does Intel HEXL use Intel AVX-512-IFMA?
Now we get to the unique innovation of Intel HEXL: take β=252. Then, the computation of Q directly maps to a vpmadd52huq instruction (with accumulation argument set to zero). Without Intel AVX-512-IFMA, the natural choice is to take β=264. Since there is no Intel AVX-512 instruction returning the element-wise high 64 bits of a 64-bit times 64-bit product, a sequence up to 14 Intel AVX-512 instructions could be used to compute Q. However, Intel AVX-512-IFMA reduces this need for 14 instructions to a single call to vpmadd52huq!
There is one caveat to this approach: the restriction p < β/4 means this optimization only applies when the finite field modulus p is less than 250 (since β=252), rather than p < 262 (as is the case when β=264). In practice, the choice of modulus roughly corresponds to bits of precision in the HE computation. So if 50-bit computation suffices for your workload, this optimization may help! Refer to Table 1 and Table 2 from the technical paper to see how the overall NTT performance compares against a native C++ baseline, and using only Intel AVX-512-DQ.
With that we’ll conclude our deep dive into why Intel AVX-512-IFMA52 provides a unique speedup for HE applications. We hope we’ve given you a glimpse into the unique innovations that Intel HEXL provides, and why the Intel AVX-512-IFMA52 instruction set is particularly powerful for HE workloads such as the NTT. We’d like to invite you to try out Intel HEXL at https://github.com/intel/hexl. We welcome discussions and bug reports. We also suggest exploring the Microsoft SEAL and Duality Technologies PALISADE HE libraries, which have adopted Intel HEXL.