Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel®...

Introduction

Motivation

Vector units in CPUs have become the de facto standard for acceleration of media, and other kernels that exhibit parallelism according to the single instruction, multiple data (SIMD) paradigm.¹ These units enable a single register file to be treated as a combination of multiple registers, whose cumulative width equals that of the vector register file. A single instruction can therefore operate in parallel on all data in this vector register, resulting in significant speedups to applications that exhibit data access trends that fit this pattern. Starting from a 64-bit vector register file that may be treated as an 8-bit register in the architecture extended with MMX™ technology, SIMD on Intel® architecture processors has evolved to enable 256-bit register files that allow for 32 parallel 8-bit operations in Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) generations.

Kernels in media workloads fit this pattern of execution naturally, because the same operation (filtering for example) is uniformly applied across several pixels of a frame. Consequently, several popular open source projects leverage SIMD instructions for code acceleration. The x264 project for Advanced Video Coding (AVC) encoding² and the x265 project for High Efficiency Video Coding (HEVC) encoding³ are the two widely used media libraries that extensively use multiple generations of SIMD instructions on Intel architecture processors, from MMX technology all the way up to Intel AVX2. As shown in Figure 1, x264 and x265 achieve two times and five times speedup respectively over their corresponding baselines that do not use any SIMD code. The x265 encoder gains more performance from Intel AVX2 when compared to x264, because the quantum of work done per frame is substantially larger for HEVC than for AVC.⁴

graph showing peformance benefits comparisons
Figure 1. Performance benefit for x264 and x265 from Intel® Advanced Vector Extensions 2 for 1080p encoding with main profile using an Intel® Core™ i7-4500U Processor.

Focus of this whitepaper

The recently released Intel® Xeon® Scalable processors, part of the platform formerly code-named Purley, have introduced the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set.⁵ Intel AVX-512 instructions are capable of performing two times the number of operations in the same number of cycles as the previous generation Intel AVX2 instruction set. To accommodate this increased throughput, a larger fraction of the die is utilized, resulting in increased power being consumed, when compared to the previous-generation SIMD units. Therefore, certain Intel AVX-512 instructions are expected to cause a higher degradation to CPU clock frequency than others.⁶ While this reduction in frequency is offset by the increased throughput for the Intel AVX-512 instructions, media kernels continue to rely significantly on SIMD instructions in older generations (because not all kernels benefit from the increased width) and on straight-line C code that is not amenable to SIMD conversion, which may see reduced performance.

This whitepaper presents a case study based on our experience using the Intel AVX-512 SIMD instructions to accelerate the compute intensive kernels of x265. We describe how we offset the reduction in CPU frequency to ensure that the overall encoder achieves positive performance benefits. Through this process, we present recommendations of when we think Intel AVX-512 should be enabled with x265 for HEVC encoding. We also share our experience on when to choose Intel AVX-512 as a vehicle for accelerating media kernels.

Key takeaways

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
For desktop and workstation SKUs (like the Intel® Core™ i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations, because the reduction in CPU clock frequency is rather low.
For server SKUs (like the Intel® Xeon® Platinum 8180 processor on which we tested), the frequency dip is higher and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock-cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.

While the results and recommendations presented in this paper are not without limitations to the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

The rest of the paper is organized as follows: The "Background" section presents the background relevant to the technical material presented in the paper. "Acceleration of x265 Kernels with Intel Advanced Vector Extensions 512" discusses the choices we made to accelerate specific kernels of x265 and discusses results for the main and main10 profiles. "Accelerating x265 Encoding with Intel Advanced Vector Extensions 512" presents the results for the overall encoder for the main and main10 profiles. Finally, Section 5 provides detailed recommendations for when Intel AVX-512 should be enabled when using x265 and generic recommendations for when Intel AVX-512 should be chosen when accelerating specific kernels. This section also describes future work.

Background

This section presents the relevant background of the concepts presented in this paper. Specifically, section "HEVC Video Encoding" provides the background on HEVC. "x265, an Open Source HEVC Encoder" discusses x265 with specific focus on the existing methods of performance optimizations that it employs. Section "Introduction to the Intel® Xeon® Scalable Processor Platform" presents the relevant background on Intel Xeon Scalable processors, and Section "SIMD Vectorization Using Intel Advanced Vector Extensions 512" discusses in more detail the Intel AVX-512 architecture.

HEVC video encoding

HEVC was ratified as an encoding standard by the JCT- VC (Joint Collaborative Team on Video Coding) in 2013 as a successor to the vastly popular AVC standard.⁴ The video encoding and decoding processes in HEVC resolves around identifying three units: a coding unit (CU) that represents each block in the picture, a prediction unit (PU) that represents the mode decision, including motion compensated prediction of the CU, and a transform unit (TU) that represents the way in which the generated residual error between the predicted and the actual block is coded.

Initially, a frame is divided into a sequence of its largest non- overlapping coding units, called a coding tree unit (CTU). A CTU can then be split into multiple CUs with variable sizes of 64x64, 32x32, 16x16, and 8x8 to form a quad-tree. Each CU is then predicted from a set of candidate-blocks, which may be in either the same frame or different frames. If the block used for the prediction is in the same frame, the block is said to intra-predicted, while if it is in a different frame, it is said to be inter-predicted.

Intra-predicted blocks are represented by a combination of the prediction block and a mode that denotes the angle of the prediction. The allowed modes for intra-prediction are labeled DC, planar, and angular modes representing various angles from the predicted block. Inter-predicted blocks are represented by a combination of the block used for prediction (the reference block) and the motion vector (MV) that represents the delta between the current and the reference block. Blocks that have zero MV are said to use the merge mode, while others use the AMP (Advanced Motion Prediction) mode. The skip mode is a special case of the merge mode when the predicted block is identical to the source, that is, no residual. The AMP modes may use PUs that are the same size of the CU (denoted as 2Nx2N PUs) or may further partition them (denoted as rectangular and asymmetric PUs) to compute the MVs. The residual generated as a difference from the original and the predicted picture is then quantized and coded using TUs that may vary from 32x32 up to 4x4 blocks, depending on the prediction mode.

The entire process of inter, intra, CU, PU, and TU selection benefits across a broad variety of usage models including big data, artificial intelligence, high-performance computing, enterprise-class IT, cloud, storage, communication, and Internet of Things. Top enhancements include performance for a wide range of workloads with one and a half of memory bandwidth, integrated network/fabric, and optional integrated accelerators. Our results in x265 indicate a significant gen- over-gen speedup of 50 – 67 percent for offline encodes when compared to the previous-generation Intel® Xeon® processor 10 is called Rate-Distortion Optimization (RDO). The goal of Intel® Xeon® processor E5-2600. This boost comes primarily from RDO is to ensure that distortion is minimized at the target bitrate or the bitrate is minimized at the target quality level as represented by distortion. Throughout the process of RDO, various combinations of CUs, PUs, and TUs are attempted by an encoder, for which it employs several kernels. In this paper, we chose to vectorize these specific kernels by converting them to use Intel AVX-512 instructions.

HEVC encoding also supports multiple profiles for encoding a video, with each profile representing a different number of bits used to represent each pixel. The main and main10 profile are popular profiles of HEVC (their AVC counterparts are called main and high profiles respectively). Each component of a pixel is represented with a minimum of 8 bits in the main profile resulting in the values ranging from 0 –255. The main10 profile uses 10 bits per pixel, allowing for a higher range of 0 –1023 for each pixel, enabling the representation of more details in the encoded video. 2.2 x265, an Open Source HEVC Encoder The x265 encoder is an open-source HEVC that compresses video in compliance to the HEVC standard.⁷ This encoder has been integrated into several open-source frameworks including VLC* , HandBrake*,⁸ and FFMpeg⁹ and is the de facto open-source video encoder for HEVC. The x265 encoder has assembly optimizations for several platforms, including Intel architecture, ARM*, and PowerPC*.

The x265 encoder employs techniques for inter-frame and intra-frame parallelism to deal with the increased complexity of HEVC encoding.¹⁰ For inter-frame parallelism, x265 encodes multiple frames in parallel by using system-level software threads. For intra-frame parallelism, x265 relies on the Wavefront Parallel Processing (WPP) tool exposed by the HEVC standard. This feature enables encoding rows of CTUs of a given frame in parallel, while ensuring that the blocks required for intra-prediction from the previous row are completed before the given block starts to encode; as per the standard, this translates to ensuring that the next CTU on the previous row completes before starting the encode of a CTU on the current row. The combination of these features gives a tremendous boost in speed with no loss in efficiency compared to the publicly available reference encoder, HM.

Introduction to the Intel® Xeon® processor Scalable family platform

The Intel® Xeon® processor Scalable family, part of the Intel® platform formerly code-named Purley, are designed to deliver new levels of consistent and breakthrough performance. The platform is based on cutting-edge technology and provides compelling the improved microarchitecture features available on Intel Xeon Scalable processors.

SIMD vectorization using Intel® AVX-512

The Intel AVX-512 vector blocks present a 512-bit register file, allowing 2X parallel data operations per cycle compared to that of Intel AVX2. Though the benefits of vectorizing kernels to use the Intel AVX-512 architecture seem obvious, several key questions must be answered specifically for media workloads before embarking on this task. First, is there sufficient parallelism inherently preset in media kernels that they can leverage this increased parallelism? Second, is the fraction of the execution that exploits this parallelism sufficiently large such that we can expect average speedups as per Amdhal’s law? Third, by enabling such vectorization, is there some effect on the execution on the serial- and non-vector codes?

Acceleration of x265 Kernels with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

As a first step in acceleration, we used handwritten Intel AVX-512 instructions to select the kernels from x265 to be accelerated. While automated tools that generate vectorized SIMD code are available, we found that handwritten assembly outperforms auto-vectorizing tools, which convinced us to use this technique. This section details how this technique was performed and the gains in cycle count we observed from these kernels for sample runs in main and main10 profiles.

Selecting the kernels to accelerate

We selected over 1,000 kernels from the core compute We selected over 1,000 kernels from the core compute of x265 to optimize with Intel AVX-512 instructions for the main and main10 profiles. These kernels were chosen based on their resource requirements. Some kernels may require frequent memory access like different block-copy and block-fill kernels, while others may involve intense computation like DCT, iDCT, and quantization kernels. There is also a third class of kernels that involve a combination of both in varying proportions. We found that ensuring that the buffers that the assembly routines accessed were 64-byte aligned reduces cache misses and in general helps Intel AVX-512 kernels. A complete list of the kernels optimized with Intel AVX-512 instructions for main and main10 kernels are listed in Appendix A1 and A2 respectively.

Framework to evaluate cycle-count improvements

The x265 encoder implements a sample test bench as a correctness and performance measurement tool for assembly kernels. It accepts valid arguments for a given kernel and invokes the C primitive and corresponding assembly kernel and compares both output buffers. It verifies all possible corner cases for the given input type by using a randomly distributed set of values. Each assembly kernel is called 100 times and checked against its C primitive output for ensuring the correctness. To measure performance improvement, the test bench measures the difference in the clock ticks (as reported by the rdtsc instruction) between the assembly kernel and the C kernel for 1,000 runs and reports the average between them.

Cycle-Count improvement for kernels in the main and main10 profiles

Figure 2 shows the cycle-count improvements for each of the 500 kernels in the main profile and the 600+ kernels in the main10 profile that were accelerated with Intel AVX-512. In each curve, the kernels are sorted in increasing order of their cycle count gains over the corresponding Intel AVX-512 implementation. Appendix A details the per-kernel gains over Intel AVX2 in cycle counts.

On average, we saw a 33 percent and 40 percent gain in the cycle count over the Intel AVX2 kernels for kernels in the main and main10 profile respectively. The reason for the higher gains is as follows. In the main10 profile, x265 uses 16 bits to represent each pixel, as opposed to the main profile, which uses 8 bits; although main10 technically only needs 10 bits, using 16 bits simplifies all data structures in the software. Therefore, the amount of work that has to be done for the same number of pixels is doubled. Due the higher quantum of compute, kernels in the main10 profile gain more from Intel AVX-512 over Intel AVX2, than what the kernels in the main profile gain. These results from cycle counts indicate that at the kernel level, there is much benefit in using Intel AVX-512 to accelerate x265. However, this does not account for the reduction in clock frequency incurred when using Intel AVX-512 instructions compared to using Intel AVX2 instructions. In the next section, we look at the effect on overall encoding time, which also accounts for this effect.

Accelerating x265 Encoding with Intel Advanced Vector Extensions 512

In this section, we look at the impact of using Intel AVX-512 kernels for real encoding use cases with x265. Section "Test Setup" describes our test setup including the videos chosen, the x265 presets used, and the system configurations of the test machines. Section "Encoding on Intel® Core™ Processors" presents results on a workstation machine with an Intel Core i9-7900X processor, while section "Encoding on Intel Xeon Scalable Processors" presents results on a typical high-end server CPU that has two Intel Xeon Platinum 8180 processors.

Test setup

Our tests mainly focused on encoding 1080p videos with the main profile and 4K videos with the main10 profile. We used four typical 1080p clips (crowdrun, ducks_take_off, park_ joy, and old_town_cross), and three 4k clips (Netflix_Boat, Netflix_FoodMarket, and Netflix_Tango) for our tests ¹⁰. Appendix B gives a little more detail, along with screenshots of the videos used. We encode the 1080p to the main profile at the following bitrates (in Kbps): 1000, 3000, 5000, and 7000. For the 4K clips, the main10 profiles target the following bitrates (in Kbps): 8000, 10000, 12000, and 14000.

We encode these videos with a version of x265 that has all the kernels described in Section 3; these kernels were contributed as part of the default branch of x265. The kernels are disabled by default and may be enabled with the –asm avx512 option in the x265 command-line interface.

A graph
Figure 2. Cycle-count gains of the main and main10 profile Intel® Advanced Vector Extensions 512 kernels over the corresponding Intel® Advanced Vector Extensions 2 kernels.

We focused our experiments on four presets of x265 to represent the wide set of use cases that x265 presents: ultrafast, veryfast, medium, and veryslow. These presets represent a wide variety of trade-offs between encode efficiency and frames per second (FPS). The veryslow preset generates the most efficient encode but is the slowest; this preset is also the preferred choice for any offline encoding use cases such as OTT. The ultrafast preset is the quickest setting of x265 but generates the encode with the lowest efficiency. The veryfast and medium presets represent intermediate points in the trade-off between performance and encoder efficiency. Typically, the more efficient presets employ more tools of HEVC, resulting in more compute-per- pixel than the less efficient presets. This is important to call out as Intel AVX-512 kernels tend to give better speedup when the compute-per-pixel is higher, as shown from the results in the previous section.

Encoding on Intel® Core™ Processors

Figure 3 shows the performance of encoding 1080p and 4K video in main and main10 profile with Intel AVX-512 kernels relative to using Intel AVX2 kernels on a workstation-like configuration with an Intel Core i9-7900X processor using a single instance of x265. The full details of the system configuration are described in Appendix C. The single instance results in high utilization of the CPU across all configurations, representing a typical use case for this system when performing HEVC encoding.

Intel® Core™ i9-7900X Processor
Graph with performance metrics
Figure 3. Encoder performance from using Intel® Advanced Vector Extensions 512 kernels on a single instance of x265, as measured on a workstation-like system with an Intel® Core™ i9-7900X processor.

From the results, we see that for all profiles and presets, enabling Intel AVX-512 kernels results in a positive performance gains. On the Intel Core i9-7900X processor system, our measurements did not indicate any significant reduction in clock frequency. The cycle-count improvements from the kernels therefore directly reflect an increased encoder performance. When we observed the relative encoder performance per encode, we observed that there were no command lines that demonstrated lower performance with Intel AVX-512 than with Intel AVX2.

We therefore recommend that for the Intel Core i9-7900X processor, and similar systems where the frequency reduction is minimal, Intel AVX-512 kernels be enabled for all encoding profiles and resolutions when using x265.

Encoding on Intel Xeon Scalable Processors

In this section, we present results from using x265 accelerated by Intel AVX-512 on a high-end server configuration with two Intel Xeon Platinum 8180 processors arranged in a dual-socket configuration with 28 hyperthreaded cores per CPU. For full details of the system configuration, refer to Appendix C.

x265 single instance performance using 8 threads and 16 threads

Figure 4 shows the performance of a single instance of x265 with kernels that use Intel AVX-512 for encoding 1080p videos in the main profile and 4K videos in the main10 profile relative to using kernels that only use Intel AVX2 instructions. Two configurations, one with 8 threads per instance and another with 16 threads per instance, are shown in the graph to understand the impact of increasing the number of active cores on the CPU; limiting the number of threads for each instance is done using the --pools option of the x265 library.

The figure shows that for a given thread configuration, the gains when encoding 4K content in the main10 profile are higher than for the 1080p content in the main profile. Also, for a given resolution and profile, the gains that we see from the presets that have more work-per-pixel (the higher efficient presets like the veryslow preset) are higher than the faster presets; in fact, for 1080p content in the main profile, we see an average performance loss. These gains are consistent with previously observed results that demonstrate that the more the work per pixel of a specific configuration, the better it is to use Intel AVX-512. Additionally, when we investigated the S-curves of these profiles (not shown here for brevity), we saw that several encoder command lines outside the 4K main10 veryslow setting lost performance over Intel AVX2.

We therefore recommend using Intel AVX-512-enabled kernels only when doing 4K encodes in the main10 profile with the versylow preset. For other presets and encoder settings, the amount of work per pixel is insufficient to offset the reduction in clock frequency to the gains in cycle-count achieved.

One additional observation we can make from Figure 4 is that the performance gains are in general higher across the board when using 8 threads for the single instance of x265, compared to the 16 threads. Upon further analysis, we observe that when more cores are activated with Intel AVX- 512 instructions in the Intel Xeon Platinum 8180 processor, the frequency reduces further, resulting in lower gains from using Intel AVX-512 instructions. In a typical server, however, encoder vendors attempt to maximize all available CPU cores to get the maximum throughput out of the given server.
This use case is explored in Section 4.3.2 where we attempt to saturate the server with 4K main10 encodes to see if the lower frequency when more cores are activated may result in muting the gains.

Intel® Xeon® Platinum 8180 Processor
graph showing peformance benefits comparisons
Figure 4. Relative performance of a single instance of x265 when using Intel® Advanced Vector Extensions 512 kernels with 8 or 16 threads over Intel® Advanced Vector Extensions 2 kernels on a server configuration with two Intel® Xeon® Platinum 8180 processors.

Saturating Intel® Xeon® Platinum 8180 processors using multiple instances of x265

To study whether activating more cores results in performance loss for 4K encodes in the main10 profile, we saturated one and both CPUs of a dual-socket Intel Xeon Platinum 8180 processor-based server with four and eight instances of x265, respectively, with each instance using 16 threads. We measured the total FPS achieved by all x265 instances to encode the same clip at different bitrates when using kernels that use Intel AVX-512 and reported the number relative to when the Intel AVX2-enabled kernels were used. Figure 5 shows these results.

Intel® Xeon® Platinum 8180 processor - Single and Dual Socket Saturation
graph showing performance benefits comparisons
Figure 5. Single-socket and dual-socket saturation of theIntel® Xeon® Platinum 8180 processor with x265 instances.

Figure 5. Shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.

Figure 5 shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

Conclusions and Future Work

In this paper, we presented our experience with using the Intel AVX-512 instructions available in the newly introduced Intel Xeon Scalable processors to accelerate the open-source HEVC encoder x265. The specific challenges that we had to overcome included selecting the right kernels to accelerate with Intel AVX-512 such that the reduction in CPU frequency were offset from the benefits in cycle count, and choosing the right encoder configuration that enabled the right balance of compute per pixel to achieve positive gains in encoder performance.

Recommendations

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
For desktop and workstation SKUs (like the Intel Core i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations because the reduction in CPU clock frequency is rather low.
For server SKUs (like the Intel Xeon Platinum 8180 processor on which we tested), the frequency dip is higher, and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock- cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

While the results and recommendations presented in this paper are not without the limitations of the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

Future work

The task of accelerating x265 with Intel AVX-512 has opened several avenues for future work. The accelerated kernels are available through the public mailing list. Future extensions of this work to enable further acceleration from Intel AVX-512 include (1) performing a thorough analysis of the use of Intel AVX-512 for videos at other resolutions and presets available in x265, (2) enabling schemes to dynamically enable and disable Intel AVX-512 kernels by monitoring the CPU frequency, and (3) a fundamental re-architecting of the encoder to segregate the worker threads into different types of threads, only some of which may run Intel AVX-512 limiting the number of cores where the CPU frequency drop is observed. We will continue to develop and contribute these solutions to open source, and encourage the reader to also contribute the project at http://x265.org.

Acknowledgements

This work was funded in part by a non-recurring engineering grant from Intel to MulticoreWare. We would like to thank the various developers and engineers at MulticoreWare for their extensive support throughout this work. In particular, we would like to thank Thomas A. Vaughan for his guidance and Min Chen for his expert comments on the assembly patches.

Appendix A

A1 – Main profile instructions per cycle (IPC) gains

Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain
sad	0.16%	i422 chroma_vss	32.70%	i420 chroma_vpp	23.19%	luma_vss	43.18%
pixelavg _pp	0.87%	luma_vss	32.89%	addAvg	23.37%	luma_vss	43.35%
i444 chroma_vps	1.14%	sad_x3	33.01%	addAvg	23.38%	i444 chroma_hpp	43.43%
i444 chroma_vps	1.18%	luma_vps	33.05%	i444 chroma_hps	23.53%	ssd_s	43.57%
pixelavg _pp	1.41%	i420 chroma_hpp	33.08%	i420 chroma_hps	23.77%	luma_hps	43.68%
convert_p2s	1.95%	i444 chroma_hpp	33.14%	var	23.95%	luma_vss	43.75%
i420 chroma_vps	2.45%	sad_x4	33.14%	i420 chroma_hpp	24.03%	luma_hps	43.84%
i420 chroma_vps	2.72%	i444 chroma_vss	33.16%	i422 chroma_vpp	24.11%	luma_hps	43.94%
i422 chroma_hps	2.83%	i420 chroma_vss	33.16%	i444 chroma_vss	24.15%	luma_vsp	44.06%
i420 p2s	3.21%	copy _ps	33.33%	i422 chroma_vss	24.15%	luma_vsp	44.11%
i444 p2s	3.21%	i420 copy _ps	33.33%	i420 chroma_vss	24.15%	sub_ps	44.11%
sad_x3	3.29%	i444 chroma_vss	33.34%	i420 chroma_vps	24.20%	i444 chroma_hpp	44.15%
i420 chroma_vps	3.62%	i422 chroma_vss	33.34%	i444 chroma_vpp	24.20%	convert_p2s	44.33%
sad_x4	4.50%	i420 chroma_vss	33.34%	i420 chroma_vpp	24.20%	i444 chroma_hpp	44.35%
sad	4.62%	i422 copy _ps	33.43%	sad	24.21%	luma_vss	44.42%
i420 chroma_hps	4.90%	i444 chroma_vss	33.43%	i444 chroma_vps	24.22%	luma_hps	44.43%
i420 chroma_hps	5.19%	i422 chroma_vss	33.43%	i420 chroma_vps	24.22%	luma_hpp	44.48%
pixel_satd	5.42%	i420 chroma_hpp	33.55%	i444 chroma_hps	24.25%	luma_vpp	44.54%
i444 chroma_vps	5.43%	i422 chroma_hpp	33.57%	i420 chroma_hpp	24.42%	luma_vss	44.61%
i422 chroma_hps	5.82%	dequant_normal	33.60%	sad_x4	24.53%	cpy1Dto2D_shl	44.61%
i444 chroma_vps	6.78%	sad_x4	33.62%	i444 chroma_hps	24.57%	luma_vsp	44.62%
dct	7.06%	i444 chroma_vss	33.89%	i422 chroma_hps	24.65%	luma_vsp	44.66%
i444 chroma_hps	7.08%	i420 chroma_vss	33.89%	psyCost_pp	24.89%	luma_vss	44.70%
i444 chroma_hps	7.26%	sad_x3	33.92%	i422 chroma_vps	25.00%	luma_vpp	44.74%
i422 chroma_vss	8.85%	i420 pixel_satd	34.01%	i444 chroma_vss	25.17%	luma_vsp	44.85%
luma_vss	9.76%	i444 chroma_hps	34.02%	i422 chroma_vss	25.17%	i422 copy _sp	45.20%
i422 chroma_hps	10.27%	luma_vps	34.04%	i420 chroma_vss	25.17%	getResidual32	45.24%
i444 chroma_hps	11.00%	i444 chroma_hpp	34.20%	i422 chroma_vps	25.66%	luma_vpp	45.30%
i444 chroma_hps	11.14%	i420 pixel_satd	34.20%	luma_vps	25.82%	luma_hps	45.35%
sad	11.26%	i420 chroma_hpp	34.23%	i444 chroma_vps	25.89%	i444 chroma_hpp	45.41%
i420 chroma_hps	11.38%	i444 chroma_vss	34.43%	i444 chroma_vps	25.92%	luma_hpp	45.49%
pixel_sa8d	11.55%	i422 chroma_vss	34.43%	i420 chroma_hps	25.95%	convert_p2s	45.52%
i444 chroma_hps	11.91%	i420 chroma_vss	34.43%	i420 chroma_vps	26.07%	luma_hps	45.58%
luma_vpp	11.96%	i422 chroma_vsp	34.59%	convert_p2s	26.25%	luma_vpp	45.62%
i422 chroma_hps	12.10%	i444 chroma_vss	34.71%	i422 chroma_vps	26.42%	convert_p2s	45.62%
copy _pp	12.54%	i444 chroma_vss	34.76%	i444 chroma_vps	26.56%	luma_vpp	45.69%
ssd_s	12.58%	addAvg	34.88%	i444 chroma_vss	26.71%	cpy2Dto1D_shl	45.75%
i420 chroma_vps	12.58%	addAvg	35.14%	i422 chroma_vss	26.71%	i422 addAvg	45.76%
i444 chroma_hps	12.79%	sad	35.43%	i420 chroma_vss	26.71%	convert_p2s	46.00%
idct	13.32%	ssd_ss	35.45%	sad_x4	26.80%	i420 add_ps	46.09%
luma_vps	13.78%	i444 chroma_vss	35.51%	i422 chroma_hpp	27.06%	add_ps	46.10%
i444 chroma_hps	13.87%	i420 pixel_satd	35.55%	i422 chroma_hps	27.13%	luma_vsp	46.14%
sad	13.88%	pixelavg _pp	35.56%	luma_hpp	27.15%	luma_hps	46.29%
copy _cnt	14.25%	luma_vpp	35.62%	i420 pixel_satd	27.23%	luma_vss	46.31%
luma_vpp	14.28%	luma_vpp	36.21%	i444 chroma_vss	27.24%	i444 chroma_vsp	46.52%
pixel_satd	14.45%	i420 chroma_hpp	36.45%	i422 chroma_vss	27.24%	i422 chroma_vsp	46.52%
idct	14.49%	i422 chroma_hpp	36.65%	luma_hpp	27.29%	i420 chroma_vsp	46.52%
pixel_satd	14.92%	i422 chroma_hpp	36.76%	luma_vps	27.45%	luma_hps	46.65%
pixel_satd	14.99%	sad	36.76%	psyCost_pp	27.62%	pixelavg _pp	46.67%
sad	15.21%	i422 chroma_hpp	36.81%	luma_vsp	27.72%	luma_vss	46.88%
idct	15.23%	copy _pp	36.82%	i422 chroma_hps	28.00%	i422 addAvg	46.88%
sad_x3	15.32%	pixelavg _pp	36.84%	pixel_satd	28.50%	luma_hps	46.90%
i444 chroma_vpp	15.47%	convert_p2s	36.87%	cpy2Dto1D_shl	28.69%	luma_vsp	46.97%
i422 chroma_vpp	15.47%	i420 p2s	36.87%	luma_vps	28.71%	i422 p2s	47.10%
i420 chroma_vpp	15.47%	i444 p2s	36.87%	i444 chroma_hpp	28.78%	copy _pp	47.11%
pixel_satd	15.52%	i444 chroma_hpp	37.07%	i420 pixel_satd	28.80%	luma_vss	47.64%
pixel_satd	15.62%	luma_vpp	37.11%	i422 pixel_satd	28.81%	i444 chroma_hpp	47.83%
pixel_satd	15.66%	luma_vss	37.49%	i422 pixel_satd	28.95%	i422 addAvg	47.85%
sad_x3	15.70%	addAvg	37.76%	luma_vss	29.26%	luma_hps	48.46%
pixel_satd	15.75%	i444 chroma_vps	37.90%	i444 chroma_vss	29.29%	copy _ps	48.57%
i420 chroma_hps	15.83%	i444 chroma_vss	38.04%	i420 chroma_hps	29.42%	sub_ps	48.83%
copy _pp	15.93%	i444 chroma_vps	38.05%	luma_vpp	29.43%	luma_hpp	48.97%
luma_vpp	16.10%	i444 chroma_vps	38.23%	scale1D_128to64	29.50%	i422 add_ps	49.02%
nquant	16.33%	sad	38.42%	luma_vss	29.59%	i444 chroma_vsp	49.43%
sad	16.35%	i444 chroma_hpp	38.45%	i444 chroma_vpp	29.69%	i420 sub_ps	49.46%
i444 chroma_vpp	16.39%	Weight_sp	38.48%	i422 chroma_vpp	29.69%	add_ps	49.50%
i420 chroma_hps	16.60%	i444 chroma_hpp	38.55%	i420 chroma_vpp	29.69%	i422 sub_ps	49.52%
i444 chroma_vpp	17.02%	sad	38.56%	i422 chroma_hps	29.71%	i420 addAvg	49.74%
i422 chroma_vpp	17.02%	luma_hpp	38.79%	i422 pixel_satd	29.75%	convert_p2s	49.75%
i420 chroma_vpp	17.02%	pixel_satd	39.15%	i444 chroma_vpp	29.82%	i422 p2s	49.75%
pixel_satd	17.08%	luma_hpp	39.21%	i422 chroma_vpp	29.82%	i444 p2s	49.75%
luma_vps	17.10%	i444 chroma_hpp	39.30%	luma_vss	29.91%	luma_vss	49.84%
luma_vps	17.36%	i444 chroma_vps	39.39%	i444 chroma_vss	29.92%	luma_hpp	50.00%
i444 chroma_vss	17.55%	addAvg	39.51%	i422 chroma_vss	29.92%	copy _sp	50.11%
i420 chroma_vss	17.55%	i420 chroma_hpp	39.55%	i420 chroma_vss	29.92%	luma_vss	50.22%
pixel_satd	17.59%	i422 pixel_satd	39.57%	luma_vps	30.19%	luma_hpp	50.61%
pixel_satd	17.66%	i422 chroma_hpp	39.61%	sad_x4	30.24%	luma_hpp	51.19%
i444 chroma_vss	18.42%	convert_p2s	39.78%	sad	30.30%	i444 chroma_vsp	51.23%
i422 chroma_vss	18.42%	i420 p2s	39.78%	luma_vps	30.37%	luma_hpp	51.70%
i420 chroma_vss	18.42%	i422 p2s	39.78%	luma_vps	30.39%	nonPsyRdoQuant	51.74%
i444 chroma_vpp	18.49%	i444 p2s	39.78%	i444 chroma_vpp	30.39%	i444 chroma_vsp	52.08%
i420 chroma_vpp	18.49%	copy _sp	39.93%	i422 chroma_vpp	30.39%	copy _pp	52.17%
luma_vps	18.50%	i420 addAvg	40.02%	i420 chroma_vpp	30.39%	i444 chroma_vsp	52.22%
luma_vpp	18.51%	luma_hps	40.04%	ssd_ss	30.44%	i444 chroma_vsp	52.28%
sad_x3	18.99%	i444 chroma_hpp	40.07%	i422 chroma_hpp	30.45%	nonPsyRdoQuant	52.32%
copy _pp	19.76%	addAvg	40.64%	i420 pixel_satd	30.53%	i422 copy _ss	52.45%
luma_vss	19.80%	luma_vsp	40.87%	i422 chroma_vpp	30.54%	nonPsyRdoQuant	52.56%
pixel_satd	19.89%	i444 chroma_vsp	40.96%	i444 chroma_hpp	30.54%	i444 chroma_vsp	52.77%
sad	20.09%	i420 chroma_vsp	40.96%	i422 chroma_hpp	30.56%	i422 chroma_vsp	52.77%
sad_x3	20.26%	luma_vss	41.01%	i444 chroma_hpp	30.63%	blockfill_s	52.93%
i444 chroma_hps	20.52%	i420 copy _sp	41.12%	i420 chroma_hpp	30.85%	i444 chroma_vsp	53.30%
i420 chroma_hps	20.80%	copy _cnt	41.14%	luma_vsp	30.95%	i422 chroma_vsp	53.30%
psyCost_pp	21.15%	luma_vsp	41.16%	sad_x4	30.95%	i420 chroma_vsp	53.30%
i444 chroma_hps	21.17%	Weight_pp	41.23%	i422 chroma_vss	30.99%	i422 chroma_vsp	53.36%
pixel_satd	21.19%	luma_hps	41.42%	i444 chroma_hps	31.12%	i444 chroma_vsp	54.34%
pixel_satd	21.21%	addAvg	41.84%	i444 chroma_vpp	31.17%	i422 chroma_vsp	54.34%
quant	21.23%	i420 addAvg	41.87%	i444 chroma_vpp	31.20%	i420 chroma_vsp	54.34%
sad_x3	21.29%	luma_vsp	41.99%	sad	31.29%	psyRdoQuant	54.44%
i444 chroma_vpp	21.42%	luma_hps	42.05%	luma_vsp	31.33%	luma_hpp	54.62%
i422 chroma_vpp	21.42%	convert_p2s	42.13%	sad_x3	31.34%	i444 chroma_vsp	54.64%
i420 chroma_vpp	21.42%	i420 p2s	42.13%	i422 pixel_satd	31.46%	i420 chroma_vsp	54.64%
i420 chroma_vps	21.60%	i422 p2s	42.13%	luma_hps	31.52%	luma_hpp	54.78%
pixel_satd	21.61%	i444 p2s	42.13%	i444 chroma_vpp	31.57%	luma_hpp	55.06%
i444 chroma_vps	21.69%	i444 chroma_vsp	42.31%	pixelavg _pp	31.62%	luma_hpp	55.40%
i422 chroma_hps	21.99%	i422 chroma_vsp	42.31%	luma_vps	31.76%	copy _pp	55.41%
i420 addAvg	22.01%	i420 chroma_vsp	42.31%	i444 chroma_hps	31.78%	psyRdoQuant	55.70%
luma_vsp	22.09%	luma_vsp	42.35%	sad_x3	31.95%	psyRdoQuant	55.72%
i444 chroma_vps	22.27%	i420 chroma_hpp	42.43%	i444 chroma_vss	31.96%	var	55.75%
i422 chroma_vps	22.41%	nonPsyRdoQuant	42.51%	i420 chroma_vss	31.96%	copy _ss	56.00%
sad_x4	22.44%	luma_hps	42.54%	i422 chroma_vss	32.01%	i444 chroma_vsp	56.36%
var	22.51%	addAvg	42.56%	i444 chroma_hpp	32.12%	i422 chroma_vsp	56.36%
i444 chroma_vpp	22.64%	luma_hps	42.58%	var	32.17%	i420 chroma_vsp	56.36%
i420 chroma_vpp	22.64%	luma_vss	42.82%	i420 chroma_hpp	32.32%	i420 copy _ss	56.63%
sad_x4	22.84%	i422 addAvg	42.93%	i444 chroma_hps	32.44%	i444 chroma_vsp	57.60%
i444 chroma_vpp	22.87%	luma_vpp	42.97%	luma_vsp	32.61%	i420 chroma_vsp	57.60%
i422 chroma_vpp	22.87%	dequant_scaling	42.98%	i444 chroma_vss	32.67%	copy _pp	58.33%
i422 chroma_hpp	22.92%	luma_hpp	42.99%	i420 chroma_vss	32.67%	copy _ss	60.09%
sad_x4	23.09%	i444 chroma_vsp	43.05%	i444 chroma_vss	32.69%	psyRdoQuant	62.80%
i444 chroma_vpp	23.19%	i422 chroma_vsp	43.05%	i422 chroma_vss	32.69%	i444 chroma_vsp	62.98%
				i420 chroma_vss	32.69%	i420 chroma_vsp	62.98%

A2 – Main10 profile IPC gains

Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain
convert_p2s	1.26%	i422 chroma_hps	39.92%	i422 chroma_vpp	29.64%	i444 chroma_hpp	49.20%
i420 p2s	1.26%	i422 p2s	40.30%	i420 chroma_vpp	29.64%	i444 chroma_hps	49.45%
i444 p2s	1.26%	luma_hpp	40.35%	i444 chroma_vsp	29.82%	cpy2Dto1D_shl	49.70%
addAvg	1.86%	i422 chroma_hpp	40.52%	i422 chroma_vsp	29.82%	luma_hvpp	49.80%
addAvg	6.88%	copy _cnt	40.55%	i420 chroma_vsp	29.82%	luma_vss	49.84%
dct	7.06%	luma_vpp	40.58%	luma_vss	29.91%	i420 chroma_hps	49.85%
sad_x3	7.65%	luma_vsp	40.59%	i444 chroma_vss	29.92%	convert_p2s	49.87%
sad	7.74%	i444 chroma_vps	40.60%	i422 chroma_vss	29.92%	i420 p2s	49.87%
sad	8.29%	i422 chroma_vps	40.60%	i420 chroma_vss	29.92%	i422 p2s	49.87%
i420 addAvg	8.36%	i420 chroma_vps	40.60%	i444 chroma_vps	29.93%	i422 p2s	49.87%
sad_x3	8.77%	sad_x3	40.64%	i422 chroma_vps	29.93%	i444 p2s	49.87%
luma_vss	9.76%	nonPsyRdoQuant	40.70%	i420 chroma_vps	29.93%	luma_hps	49.94%
intra_pred_ang27	9.79%	add_ps	40.71%	luma_vsp	30.06%	i422 chroma_hps	50.07%
cpy2Dto1D_shl	10.13%	sad_x4	40.73%	i444 chroma_vsp	30.11%	i444 chroma_hpp	50.13%
sad_x3	10.81%	luma_vpp	40.73%	i422 chroma_vsp	30.11%	luma_vss	50.22%
sad_x4	10.96%	copy _pp	40.81%	i420 chroma_vsp	30.11%	luma_hpp	50.25%
i420 addAvg	11.05%	i422 chroma_vps	40.88%	pixel_satd	30.30%	i420 chroma_vpp	50.28%
pixel_satd	11.05%	luma_vss	41.01%	i422 pixel_satd	30.30%	luma_hps	50.67%
i420 pixel_satd	11.05%	i444 chroma_vsp	41.02%	i422 pixel_satd	30.35%	addAvg	50.67%
i422 pixel_satd	11.05%	i420 chroma_vsp	41.02%	add_ps	30.69%	i422 addAvg	50.67%
luma_vsp	12.64%	i444 chroma_vsp	41.05%	sad	30.94%	luma_hpp	50.75%
copy _cnt	13.29%	i420 chroma_vsp	41.05%	dequant_normal	31.10%	i420 chroma_hpp	50.82%
idct	13.32%	sad	41.06%	sad	31.37%	copy _pp	50.95%
i444 chroma_vps	14.44%	intra_pred_ang34	41.06%	pixel_satd	31.43%	i422 addAvg	50.99%
i422 chroma_vps	14.44%	convert_p2s	41.09%	i420 pixel_satd	31.43%	luma_hps	51.17%
i420 chroma_vps	14.44%	i444 p2s	41.09%	i422 pixel_satd	31.43%	i422 chroma_hpp	51.22%
idct	14.49%	nonPsyRdoQuant	41.21%	i444 chroma_vpp	31.60%	i444 chroma_hpp	51.37%
i444 chroma_vpp	14.84%	sad_x4	41.22%	i422 chroma_vss	31.76%	luma_hpp	51.48%
idct	15.23%	i422 chroma_vpp	41.25%	i444 chroma_vss	31.96%	luma_hps	51.57%
luma_vsp	15.24%	i420 chroma_vpp	41.25%	i420 chroma_vss	31.96%	copy _ss	51.58%
sad_x3	15.53%	i420 chroma_vpp	41.36%	sad	31.99%	luma_hpp	51.63%
addAvg	15.60%	i444 chroma_vsp	41.40%	psyCost_pp	32.12%	luma_hps	51.64%
i422 chroma_vpp	15.71%	luma_vpp	41.43%	i420 chroma_hps	32.32%	luma_hps	51.65%
i420 chroma_vpp	15.71%	luma_hvpp	41.46%	i422 addAvg	32.46%	luma_hps	51.70%
addAvg	15.90%	luma_vpp	41.48%	i422 chroma_vss	32.62%	luma_hps	51.81%
i422 chroma_vpp	16.07%	i444 chroma_vsp	41.51%	i444 chroma_vss	32.67%	i422 chroma_hpp	51.86%
intra_pred_ang25	16.22%	luma_hvpp	41.54%	i420 chroma_vss	32.67%	luma_hps	51.89%
nquant	16.33%	intra_pred_ang11	41.55%	i444 chroma_vss	32.69%	addAvg	51.89%
sad_x4	16.42%	convert_p2s	41.58%	i422 chroma_vss	32.69%	i420 addAvg	51.89%
luma_vsp	16.55%	sad_x4	41.71%	i420 chroma_vss	32.69%	i422 addAvg	51.89%
i420 addAvg	17.12%	sad_x4	41.71%	luma_vss	32.89%	luma_hps	51.93%
sad_x4	17.33%	luma_vsp	41.78%	i444 chroma_vsp	33.14%	luma_hps	51.99%
i444 chroma_vss	17.55%	sad_x4	41.83%	i422 chroma_vsp	33.14%	i444 chroma_hpp	52.16%
i420 chroma_vss	17.55%	i444 chroma_vsp	42.01%	i444 chroma_vss	33.16%	i422 copy _sp	52.45%
i444 chroma_vps	17.88%	i444 chroma_vsp	42.08%	i420 chroma_vss	33.16%	i422 copy _ps	52.45%
i422 chroma_vps	17.88%	i422 chroma_vsp	42.08%	convert_p2s	33.27%	i422 copy _ss	52.45%
i420 chroma_vps	17.88%	nonPsyRdoQuant	42.13%	i444 chroma_vss	33.34%	i444 chroma_hps	52.94%
pixel_satd	18.02%	pixelavg _pp	42.17%	i422 chroma_vss	33.34%	copy _ss	53.20%
i422 addAvg	18.13%	i422 chroma_vpp	42.20%	i420 chroma_vss	33.34%	i420 chroma_hps	53.22%
i444 chroma_vss	18.42%	i420 chroma_vpp	42.20%	i444 chroma_vss	33.43%	i422 chroma_hps	53.27%
i422 chroma_vss	18.42%	luma_vps	42.30%	i422 chroma_vss	33.43%	i420 chroma_hpp	53.48%
i420 chroma_vss	18.42%	sub_ps	42.52%	pixelavg _pp	33.45%	copy _pp	53.53%
addAvg	19.50%	luma_vsp	42.55%	pixel_satd	33.45%	i422 chroma_hpp	53.81%
i444 chroma_vps	19.54%	luma_hvpp	42.65%	i420 pixel_satd	33.45%	i422 chroma_hpp	53.89%
i422 chroma_vps	19.54%	pixelavg _pp	42.65%	addAvg	33.46%	i444 chroma_hpp	54.31%
i420 chroma_vps	19.54%	luma_vps	42.72%	luma_vsp	33.47%	ssd_ss	54.69%
sad_x3	19.75%	convert_p2s	42.77%	sad_x4	33.51%	i422 chroma_hpp	54.77%
luma_vss	19.80%	luma_vss	42.82%	i444 chroma_vsp	33.79%	i420 chroma_hpp	55.18%
i422 pixel_satd	19.95%	luma_vsp	43.05%	i422 chroma_vsp	33.79%	luma_hpp	55.53%
pixel_satd	20.02%	convert_p2s	43.11%	i420 chroma_vsp	33.79%	i444 chroma_hpp	55.56%
i420 pixel_satd	20.02%	i444 chroma_hpp	43.15%	i444 chroma_vss	33.89%	i444 chroma_hpp	55.78%
i422 pixel_satd	20.02%	luma_vsp	43.17%	i420 chroma_vss	33.89%	i444 chroma_hpp	55.94%
i444 chroma_vps	20.09%	luma_vss	43.18%	luma_vsp	34.08%	luma_hpp	55.96%
i420 chroma_vps	20.09%	luma_vsp	43.22%	sub_ps	34.13%	copy _sp	56.00%
i422 chroma_vss	20.53%	luma_hvpp	43.24%	i444 chroma_vsp	34.18%	copy _ps	56.00%
sad_x4	20.69%	luma_vss	43.35%	i420 chroma_vsp	34.18%	i444 chroma_hpp	56.07%
i444 chroma_vps	20.86%	luma_vsp	43.36%	i444 chroma_vsp	34.22%	luma_hpp	56.16%
i422 chroma_vps	20.86%	i420 chroma_hpp	43.38%	i422 chroma_vsp	34.22%	i420 copy _sp	56.63%
i444 chroma_vpp	20.98%	cpy1Dto2D_shl	43.50%	i420 chroma_vsp	34.22%	i420 copy _ps	56.63%
quant	21.23%	luma_vsp	43.50%	i444 chroma_vss	34.43%	i420 copy _ss	56.63%
i422 chroma_vpp	21.45%	luma_vpp	43.51%	i422 chroma_vss	34.43%	i422 chroma_hpp	57.32%
sad	21.61%	copy _pp	43.54%	i420 chroma_vss	34.43%	i444 chroma_hps	57.33%
i444 chroma_vpp	21.78%	luma_hvpp	43.57%	pixel_satd	34.59%	luma_hpp	57.40%
i444 chroma_vps	22.06%	luma_vpp	43.58%	i444 chroma_vss	34.71%	i420 chroma_hps	57.97%
i420 chroma_vps	22.06%	luma_hvpp	43.60%	i444 chroma_vss	34.76%	luma_hpp	58.55%
i444 chroma_vsp	22.12%	luma_vss	43.75%	intra_pred_ang10	34.76%	i444 chroma_hps	59.21%
i422 chroma_vsp	22.12%	luma_vps	43.77%	i444 chroma_vps	34.80%	i420 chroma_hps	59.46%
i420 chroma_vsp	22.12%	i444 chroma_vsp	43.80%	i444 chroma_vps	34.98%	blockfill_s	59.53%
i444 chroma_vsp	22.14%	i420 chroma_vsp	43.80%	luma_vps	35.07%	luma_hpp	59.56%
i422 chroma_vsp	22.14%	pixelavg _pp	43.94%	i444 chroma_vps	35.34%	i422 chroma_hps	59.75%
i420 chroma_vsp	22.14%	psyRdoQuant	44.02%	Weight_pp	35.37%	copy _sp	60.09%
i422 chroma_vpp	22.28%	sad_x3	44.17%	i444 chroma_vss	35.51%	copy _ps	60.09%
i420 chroma_vpp	22.28%	pixelavg _pp	44.23%	luma_vps	35.63%	luma_hps	60.23%
i444 chroma_vpp	22.28%	luma_hvpp	44.24%	i422 chroma_hps	35.68%	psyRdoQuant	60.25%
i422 chroma_vpp	22.35%	luma_hvpp	44.28%	i444 chroma_vps	36.38%	luma_hpp	60.26%
ssd_ss	22.60%	luma_vsp	44.31%	i422 chroma_vss	36.56%	i444 chroma_hps	60.28%
i444 chroma_vpp	23.06%	dequant_scaling	44.37%	sad	36.66%	i420 chroma_hps	60.48%
sad_x4	23.09%	convert_p2s	44.40%	luma_vpp	36.68%	luma_hps	60.76%
luma_vpp	23.67%	luma_vpp	44.41%	i444 chroma_vpp	36.70%	copy _pp	60.87%
luma_vpp	23.82%	luma_vss	44.42%	luma_vsp	36.71%	i444 chroma_hps	60.92%
i444 chroma_vpp	23.84%	sad_x4	44.42%	sad_x3	36.75%	i422 chroma_hps	61.09%
i444 chroma_vss	24.15%	luma_vpp	44.60%	sad_x4	36.78%	luma_hpp	61.28%
i422 chroma_vss	24.15%	luma_vss	44.61%	pixel_satd	36.88%	i444 chroma_hpp	61.38%
i420 chroma_vss	24.15%	luma_hvpp	44.61%	i422 chroma_vpp	36.91%	luma_hpp	61.43%
intra_pred_ang9	24.37%	getResidual32	44.64%	copy _pp	36.96%	luma_hpp	61.44%
i444 chroma_vpp	24.41%	luma_hpp	44.68%	addAvg	37.08%	i422 chroma_hps	61.55%
luma_vpp	24.48%	luma_vss	44.70%	sad_x4	37.09%	luma_hpp	61.58%
i422 addAvg	24.62%	luma_hvpp	44.73%	i420 chroma_vpp	37.29%	luma_hpp	62.26%
psyCost_pp	24.88%	i444 chroma_vsp	44.76%	i422 chroma_vpp	37.36%	i422 chroma_hps	62.31%
i420 chroma_vpp	24.90%	i422 chroma_vsp	44.76%	i420 chroma_vpp	37.36%	luma_hpp	62.35%
i422 chroma_vpp	25.11%	i420 chroma_vsp	44.76%	luma_vss	37.49%	i420 chroma_hpp	62.39%
i420 chroma_vpp	25.11%	sad_x4	44.85%	luma_vpp	37.53%	i420 chroma_hps	62.39%
i444 chroma_vps	25.17%	luma_hvpp	45.15%	i444 chroma_vps	37.54%	i444 chroma_hpp	62.46%
i422 chroma_vps	25.17%	luma_vps	45.19%	i422 chroma_vps	37.54%	luma_hpp	62.63%
i420 chroma_vps	25.17%	i422 chroma_hpp	45.23%	i420 chroma_vps	37.54%	i444 chroma_hps	62.88%
i444 chroma_vss	25.17%	intra_pred_dc	45.26%	i444 chroma_vpp	37.59%	i420 chroma_hps	62.95%
i422 chroma_vss	25.17%	sad	45.31%	i420 chroma_vpp	37.59%	luma_hpp	63.07%
i420 chroma_vss	25.17%	luma_vps	45.36%	i444 chroma_vps	37.59%	i444 chroma_hps	63.15%
i422 chroma_vps	25.28%	psyRdoQuant	45.40%	i422 chroma_vps	37.59%	luma_hps	63.16%
i444 chroma_vps	25.97%	i420 add_ps	45.40%	pixel_satd	37.60%	i420 chroma_hpp	63.34%
i422 chroma_vps	25.97%	pixelavg _pp	45.52%	i444 chroma_vps	37.60%	luma_hpp	63.61%
i420 chroma_vps	25.97%	addAvg	45.54%	i420 chroma_vps	37.60%	i420 chroma_hps	63.85%
luma_vpp	26.22%	i420 addAvg	45.54%	i444 chroma_vsp	37.66%	luma_hpp	63.91%
sad	26.25%	i422 addAvg	45.54%	i422 chroma_vps	37.68%	i420 chroma_hpp	64.12%
psyCost_pp	26.30%	i444 chroma_vsp	45.57%	i444 chroma_vpp	37.69%	i444 chroma_hps	64.15%
i444 chroma_vsp	26.38%	i422 chroma_vsp	45.57%	i444 chroma_vps	37.71%	i444 chroma_hpp	64.23%
i420 chroma_vsp	26.38%	i420 chroma_vsp	45.57%	i420 chroma_vps	37.71%	i422 chroma_hpp	64.39%
i420 addAvg	26.39%	luma_vps	45.58%	convert_p2s	37.73%	i422 chroma_hpp	64.56%
i422 addAvg	26.39%	pixelavg _pp	45.61%	i420 p2s	37.73%	i444 chroma_hps	64.84%
pixel_satd	26.62%	luma_vps	45.62%	i422 p2s	37.73%	i422 chroma_hps	64.87%
i444 chroma_vss	26.71%	luma_vps	45.64%	i444 p2s	37.73%	i444 chroma_hpp	64.92%
i422 chroma_vss	26.71%	sad_x3	45.65%	i444 chroma_vpp	37.74%	i420 chroma_hps	64.93%
i420 chroma_vss	26.71%	i422 add_ps	45.68%	i444 chroma_vpp	37.76%	i422 chroma_hpp	65.05%
luma_vsp	26.77%	addAvg	45.72%	addAvg	37.80%	i444 chroma_hps	65.06%
luma_vps	27.04%	i420 addAvg	45.72%	i422 chroma_vpp	37.99%	i420 chroma_hpp	65.14%
luma_vpp	27.10%	pixelavg _pp	45.80%	i444 chroma_vss	38.04%	i422 chroma_hps	65.35%
i444 chroma_vss	27.24%	i444 chroma_hpp	45.95%	i420 chroma_hpp	38.04%	i422 chroma_hps	65.63%
i422 chroma_vss	27.24%	psyRdoQuant	45.96%	luma_vps	38.08%	i444 chroma_hps	65.72%
i422 chroma_vps	27.26%	luma_vsp	45.97%	i444 chroma_vpp	38.09%	i422 chroma_hpp	65.80%
i420 addAvg	27.28%	sad	46.04%	i444 chroma_vpp	38.27%	i444 chroma_hpp	65.88%
i422 addAvg	27.28%	luma_hvpp	46.17%	i422 chroma_vpp	38.27%	i420 chroma_hpp	65.92%
addAvg	27.55%	luma_vss	46.31%	i444 chroma_hps	38.30%	i420 chroma_hpp	65.94%
i422 chroma_vpp	27.71%	sad_x3	46.36%	intra_pred_ang2	38.34%	i444 chroma_hps	66.03%
i420 chroma_vpp	27.71%	sad_x3	46.42%	i444 chroma_hps	38.37%	i422 chroma_hps	66.03%
pixel_satd	27.93%	luma_vps	46.44%	i444 chroma_vpp	38.48%	i420 chroma_hps	66.15%
ssd_s	28.04%	luma_hpp	46.46%	copy _pp	38.51%	i422 chroma_hpp	66.20%
pixel_satd	28.10%	i444 chroma_vsp	46.66%	addAvg	38.54%	i422 chroma_hps	66.20%
pixelavg _pp	28.47%	sad_x3	46.71%	nonPsyRdoQuant	38.57%	i420 chroma_hps	66.29%
i420 pixel_satd	28.54%	luma_hpp	46.82%	sad_x3	38.74%	i422 chroma_hpp	66.32%
i422 pixel_satd	28.54%	luma_vss	46.88%	sad_x3	38.80%	i444 chroma_hpp	66.38%
pixel_satd	28.56%	i422 chroma_hps	46.99%	sad	38.84%	i444 chroma_vpp	66.41%
i420 pixel_satd	28.56%	intra_pred_ang26	47.26%	Weight_sp	38.86%	i444 chroma_hps	66.50%
i422 pixel_satd	28.56%	luma_vps	47.31%	pixel_satd	38.88%	i444 chroma_vpp	66.61%
i444 chroma_vps	28.75%	luma_hvpp	47.44%	i420 pixel_satd	38.88%	i444 chroma_vpp	66.63%
luma_vps	28.78%	pixelavg _pp	47.50%	copy _pp	38.96%	i444 chroma_hps	66.64%
luma_vps	28.82%	luma_vss	47.64%	i422 sub_ps	39.19%	i444 chroma_hpp	66.64%
i422 chroma_hps	28.86%	luma_vps	47.69%	i420 sub_ps	39.34%	i420 chroma_hpp	66.64%
i420 chroma_hps	29.02%	i420 chroma_hpp	47.78%	i420 chroma_hps	39.47%	i420 chroma_hpp	66.65%
sad_x3	29.04%	i422 chroma_hps	47.82%	luma_vpp	39.54%	i444 chroma_hps	66.71%
i444 chroma_hps	29.11%	luma_vsp	47.93%	luma_hvpp	39.63%	i422 chroma_hpp	66.71%
luma_vsp	29.13%	luma_hvpp	48.30%	i444 chroma_vps	39.68%	i444 chroma_hps	66.75%
luma_vss	29.26%	addAvg	48.40%	i420 chroma_vps	39.68%	i444 chroma_hps	66.91%
i444 chroma_vss	29.29%	i420 addAvg	48.40%	luma_hpp	39.72%	i422 chroma_hpp	66.92%
luma_vpp	29.39%	luma_hps	48.96%	addAvg	39.77%	i444 chroma_hpp	67.59%
luma_vss	29.59%	luma_hps	49.05%	convert_p2s	39.79%	i444 chroma_hpp	67.78%
				i420 p2s	39.79%	i420 chroma_hpp	69.14%
				i444 p2s	39.79%	i444 chroma_hpp	69.23%

Appendix B

1080p Test Clips and Bitrates Used

The following 1080p clips were used for generating test results.

passerby in a verdant sunny park
park_ joy _1080p.y4m

large crowd of joggers in a park
crowd_run_1080p50.y4m

ducks loligagging in a blue pond
ducks_take_off_1080p50.y4m

Urban landscape of old European city
old_town_cross_1080p50.y4m

4k Test Clips and Bitrates Used

The following 4k clips were used for generating test results.

vacation panaroma
Netflix_Boat_4096x2160_60fps_10bit_420.y4m

Tango afficionados
Netflix_Tango_4096x2160_60fps_10bit_420.y4m

a rural open market
Netflix_FoodMarket_4096x2160_60fps_10bit_420.y4m

Appendix C

Configurations for Testing on Intel® Core™ i7-4500U Processor
System Attribute	Value
OS Name	Windows 10 professional
Version	10.0.16299 Build 16299
System Model	MS-7A93
System Type	x64-based PC
Processor	Intel® Core™ i7- 4500U CPU @ 3.30GHz, 3312 MHz, 10 Core(s), 20 Logical Processor(s)
Core(s) per socket:	2
Thread(s) per core:	2
Socket(s):	1
NUMA node(s):	1

BIOS
BIOS Version/Date	American Megatrends Inc. 1.00, 6/2/2017
SMBIOS Version	3
BIOS Mode	UEFI

Graphic Interface:
Version	PCI-Express
Link Width	x16
Max. Supported	x16

Memory:
Type	DDR3
Channel	1
Size	8 GB
DRAM Frequency	800 MHz
command Rate (CR)	2T

Configurations for Testing on Intel® Core™ i9-7900X Processor
System Attribute	Value
OS Name	Microsoft Windows 10 Enterprise
Version	110.0.16299 Build 16299
System Model	MS-7A93
System Type	x64-based PC
Processor	Intel® Core™ i9-7900X CPU at 3.30GHz, 3312Mhz, 10 Core(s), 20 Logical Processor(s)
Core(s) per socket:	10
Thread(s) per core:	2
Socket(s):	1
NUMA node(s):	1

BIOS
BIOS Version/Date	American Megatrends Inc. 1.00, 6/2/2017
SMBIOS Version	3
BIOS Mode	UEFI

Graphic Interface:
Version	PCI-Express
Link Width	x16
Max. Supported	x16

Memory:
Type	DDR4
Channel	2
Size	32 GB
DRAM Frequency	1066.8 MHz
command Rate (CR)	2T

Configurations for Testing on Intel® Xeon® Platinum 8180 Processor
System Attribute	Value
OS Name	CentOS
Version	7.2
System Model	Intel S4PR1SY2B
System Type	x86_64
Processor	Intel® Xeon® Platinum 8180 CPU at 2.50 GHz
Core(s) per socket:	28
Thread(s) per core:	2
Socket(s):	2
NUMA node(s):	2

BIOS
BIOS Version/Date	SE5C620.86B.0X. 01.0007.062120172 125 / 06/21/2017
SMBIOS Version	2.8
BIOS Mode	UEFI

Graphic Interface:
Version	PCI-Express
Link Width	x16
Max. Supported	x16

Memory:
Type	DDR4
Channel	2
Size	192 GB
DRAM Frequency	1333 MHz
command Rate (CR)	2T

References

David A. Patterson and John L. Hennessey, Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751.
VideoLAN Organization, x264, The best H.264/AVC encoder. https://www.videolan.org/developers/x264.html
MulticoreWare Inc., x265 HEVC Encoder/H.265 Video Codec. http://x265.org/
G. J. Sullivan, J.-R. Ohm, W.-J. Han and T. Wigand, "Overview of the High Efficiency Video Coding (HEVC) Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12,pp. 1649-1668, 2012.
Intel Corporation, Intel Advanced Vector Instructions 512. https://www.intel.in/content/www/in/en/architecture-and-technology/avx-512-overview.html
Intel Corporation, "Intel® Xeon® Processor Scalable Family Specification Update", February, 2018. https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
x265.org
HandBrake, An OpenSource Video Transcoder.https://handbrake.fr/
FFMPEG, A complete, cross-platform solution to record, convert and stream audio and video.
MulticoreWare Inc., "x265 Receives Significant Boost from Intel Xeon Scalable Processor Family." http://x265.org/x265-receives-significant-boost-intel-xeon-scalable-processor-family/

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Introduction

Motivation

Focus of this whitepaper

Key takeaways

Background

HEVC video encoding

Introduction to the Intel® Xeon® processor Scalable family platform

SIMD vectorization using Intel® AVX-512

Acceleration of x265 Kernels with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Selecting the kernels to accelerate

Framework to evaluate cycle-count improvements

Cycle-Count improvement for kernels in the main and main10 profiles

Accelerating x265 Encoding with Intel Advanced Vector Extensions 512

Test setup

Encoding on Intel® Core™ Processors

Encoding on Intel Xeon Scalable Processors

x265 single instance performance using 8 threads and 16 threads

Saturating Intel® Xeon® Platinum 8180 processors using multiple instances of x265

Conclusions and Future Work

Recommendations

Future work

Acknowledgements

Appendix A

A1 – Main profile instructions per cycle (IPC) gains

A2 – Main10 profile IPC gains

Appendix B

Appendix C

References

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Introduction

Motivation

Focus of this whitepaper

Key takeaways

Background

HEVC video encoding

Introduction to the Intel® Xeon® processor Scalable family platform

SIMD vectorization using Intel® AVX-512

Acceleration of x265 Kernels with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Selecting the kernels to accelerate

Framework to evaluate cycle-count improvements

Cycle-Count improvement for kernels in the main and main10 profiles

Accelerating x265 Encoding with Intel Advanced Vector Extensions 512

Test setup

Encoding on Intel® Core™ Processors

Encoding on Intel Xeon Scalable Processors

x265 single instance performance using 8 threads and 16 threads

Saturating Intel® Xeon® Platinum 8180 processors using multiple instances of x265

Conclusions and Future Work

Recommendations

Future work

Acknowledgements

Appendix A

A1 – Main profile instructions per cycle (IPC) gains

A2 – Main10 profile IPC gains

Appendix B

Appendix C

References

Product and Performance Information