Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

ID 672286
Updated 5/11/2018
Version Latest
Public

author-image

By

Introduction

Motivation

Vector units in CPUs have become the de facto standard for acceleration of media, and other kernels that exhibit parallelism according to the single instruction, multiple data (SIMD) paradigm.1 These units enable a single register file to be treated as a combination of multiple registers, whose cumulative width equals that of the vector register file. A single instruction can therefore operate in parallel on all data in this vector register, resulting in significant speedups to applications that exhibit data access trends that fit this pattern. Starting from a 64-bit vector register file that may be treated as an 8-bit register in the architecture extended with MMX™ technology, SIMD on Intel® architecture processors has evolved to enable 256-bit register files that allow for 32 parallel 8-bit operations in Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) generations.

Kernels in media workloads fit this pattern of execution naturally, because the same operation (filtering for example) is uniformly applied across several pixels of a frame. Consequently, several popular open source projects leverage SIMD instructions for code acceleration. The x264 project for Advanced Video Coding (AVC) encoding2 and the x265 project for High Efficiency Video Coding (HEVC) encoding3 are the two widely used media libraries that extensively use multiple generations of SIMD instructions on Intel architecture processors, from MMX technology all the way up to Intel AVX2. As shown in Figure 1, x264 and x265 achieve two times and five times speedup respectively over their corresponding baselines that do not use any SIMD code. The x265 encoder gains more performance from Intel AVX2 when compared to x264, because the quantum of work done per frame is substantially larger for HEVC than for AVC.4

graph showing peformance benefits comparisons
Figure 1. Performance benefit for x264 and x265 from Intel® Advanced Vector Extensions 2 for 1080p encoding with main profile using an Intel® Core™ i7-4500U Processor.

Focus of this whitepaper

The recently released Intel® Xeon® Scalable processors, part of the platform formerly code-named Purley, have introduced the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set.5 Intel AVX-512 instructions are capable of performing two times the number of operations in the same number of cycles as the previous generation Intel AVX2 instruction set. To accommodate this increased throughput, a larger fraction of the die is utilized, resulting in increased power being consumed, when compared to the previous-generation SIMD units. Therefore, certain Intel AVX-512 instructions are expected to cause a higher degradation to CPU clock frequency than others.6 While this reduction in frequency is offset by the increased throughput for the Intel AVX-512 instructions, media kernels continue to rely significantly on SIMD instructions in older generations (because not all kernels benefit from the increased width) and on straight-line C code that is not amenable to SIMD conversion, which may see reduced performance.

This whitepaper presents a case study based on our experience using the Intel AVX-512 SIMD instructions to accelerate the compute intensive kernels of x265. We describe how we offset the reduction in CPU frequency to ensure that the overall encoder achieves positive performance benefits. Through this process, we present recommendations of when we think Intel AVX-512 should be enabled with x265 for HEVC encoding. We also share our experience on when to choose Intel AVX-512 as a vehicle for accelerating media kernels.

Key takeaways

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

  • When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
  • For desktop and workstation SKUs (like the Intel® Core™ i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations, because the reduction in CPU clock frequency is rather low.
  • For server SKUs (like the Intel® Xeon® Platinum 8180 processor on which we tested), the frequency dip is higher and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock-cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.

While the results and recommendations presented in this paper are not without limitations to the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

The rest of the paper is organized as follows: The "Background" section presents the background relevant to the technical material presented in the paper. "Acceleration of x265 Kernels with Intel Advanced Vector Extensions 512" discusses the choices we made to accelerate specific kernels of x265 and discusses results for the main and main10 profiles. "Accelerating x265 Encoding with Intel Advanced Vector Extensions 512" presents the results for the overall encoder for the main and main10 profiles. Finally, Section 5 provides detailed recommendations for when Intel AVX-512 should be enabled when using x265 and generic recommendations for when Intel AVX-512 should be chosen when accelerating specific kernels. This section also describes future work.

Background

This section presents the relevant background of the concepts presented in this paper. Specifically, section "HEVC Video Encoding" provides the background on HEVC. "x265, an Open Source HEVC Encoder" discusses x265 with specific focus on the existing methods of performance optimizations that it employs. Section "Introduction to the Intel® Xeon® Scalable Processor Platform" presents the relevant background on Intel Xeon Scalable processors, and Section "SIMD Vectorization Using Intel Advanced Vector Extensions 512" discusses in more detail the Intel AVX-512 architecture.

HEVC video encoding

HEVC was ratified as an encoding standard by the JCT- VC (Joint Collaborative Team on Video Coding) in 2013 as a successor to the vastly popular AVC standard.4 The video encoding and decoding processes in HEVC resolves around identifying three units: a coding unit (CU) that represents each block in the picture, a prediction unit (PU) that represents the mode decision, including motion compensated prediction of the CU, and a transform unit (TU) that represents the way in which the generated residual error between the predicted and the actual block is coded.

Initially, a frame is divided into a sequence of its largest non- overlapping coding units, called a coding tree unit (CTU). A CTU can then be split into multiple CUs with variable sizes of 64x64, 32x32, 16x16, and 8x8 to form a quad-tree. Each CU is then predicted from a set of candidate-blocks, which may be in either the same frame or different frames. If the block used for the prediction is in the same frame, the block is said to intra-predicted, while if it is in a different frame, it is said to be inter-predicted.

Intra-predicted blocks are represented by a combination of the prediction block and a mode that denotes the angle of the prediction. The allowed modes for intra-prediction are labeled DC, planar, and angular modes representing various angles from the predicted block. Inter-predicted blocks are represented by a combination of the block used for prediction (the reference block) and the motion vector (MV) that represents the delta between the current and the reference block. Blocks that have zero MV are said to use the merge mode, while others use the AMP (Advanced Motion Prediction) mode. The skip mode is a special case of the merge mode when the predicted block is identical to the source, that is, no residual. The AMP modes may use PUs that are the same size of the CU (denoted as 2Nx2N PUs) or may further partition them (denoted as rectangular and asymmetric PUs) to compute the MVs. The residual generated as a difference from the original and the predicted picture is then quantized and coded using TUs that may vary from 32x32 up to 4x4 blocks, depending on the prediction mode.

The entire process of inter, intra, CU, PU, and TU selection benefits across a broad variety of usage models including big data, artificial intelligence, high-performance computing, enterprise-class IT, cloud, storage, communication, and Internet of Things. Top enhancements include performance for a wide range of workloads with one and a half of memory bandwidth, integrated network/fabric, and optional integrated accelerators. Our results in x265 indicate a significant gen- over-gen speedup of 50 – 67 percent for offline encodes when compared to the previous-generation Intel® Xeon® processor 10 is called Rate-Distortion Optimization (RDO). The goal of Intel® Xeon® processor E5-2600. This boost comes primarily from RDO is to ensure that distortion is minimized at the target bitrate or the bitrate is minimized at the target quality level as represented by distortion. Throughout the process of RDO, various combinations of CUs, PUs, and TUs are attempted by an encoder, for which it employs several kernels. In this paper, we chose to vectorize these specific kernels by converting them to use Intel AVX-512 instructions.

HEVC encoding also supports multiple profiles for encoding a video, with each profile representing a different number of bits used to represent each pixel. The main and main10 profile are popular profiles of HEVC (their AVC counterparts are called main and high profiles respectively). Each component of a pixel is represented with a minimum of 8 bits in the main profile resulting in the values ranging from 0 –255. The main10 profile uses 10 bits per pixel, allowing for a higher range of 0 –1023 for each pixel, enabling the representation of more details in the encoded video. 2.2 x265, an Open Source HEVC Encoder The x265 encoder is an open-source HEVC that compresses video in compliance to the HEVC standard.7 This encoder has been integrated into several open-source frameworks including VLC* , HandBrake*,8 and FFMpeg9 and is the de facto open-source video encoder for HEVC. The x265 encoder has assembly optimizations for several platforms, including Intel architecture, ARM*, and PowerPC*.

The x265 encoder employs techniques for inter-frame and intra-frame parallelism to deal with the increased complexity of HEVC encoding.10 For inter-frame parallelism, x265 encodes multiple frames in parallel by using system-level software threads. For intra-frame parallelism, x265 relies on the Wavefront Parallel Processing (WPP) tool exposed by the HEVC standard. This feature enables encoding rows of CTUs of a given frame in parallel, while ensuring that the blocks required for intra-prediction from the previous row are completed before the given block starts to encode; as per the standard, this translates to ensuring that the next CTU on the previous row completes before starting the encode of a CTU on the current row. The combination of these features gives a tremendous boost in speed with no loss in efficiency compared to the publicly available reference encoder, HM.

Introduction to the Intel® Xeon® processor Scalable family platform

The Intel® Xeon® processor Scalable family, part of the Intel® platform formerly code-named Purley, are designed to deliver new levels of consistent and breakthrough performance. The platform is based on cutting-edge technology and provides compelling the improved microarchitecture features available on Intel Xeon Scalable processors.

SIMD vectorization using Intel® AVX-512

The Intel AVX-512 vector blocks present a 512-bit register file, allowing 2X parallel data operations per cycle compared to that of Intel AVX2. Though the benefits of vectorizing kernels to use the Intel AVX-512 architecture seem obvious, several key questions must be answered specifically for media workloads before embarking on this task. First, is there sufficient parallelism inherently preset in media kernels that they can leverage this increased parallelism? Second, is the fraction of the execution that exploits this parallelism sufficiently large such that we can expect average speedups as per Amdhal’s law? Third, by enabling such vectorization, is there some effect on the execution on the serial- and non-vector codes?

Acceleration of x265 Kernels with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

As a first step in acceleration, we used handwritten Intel AVX-512 instructions to select the kernels from x265 to be accelerated. While automated tools that generate vectorized SIMD code are available, we found that handwritten assembly outperforms auto-vectorizing tools, which convinced us to use this technique. This section details how this technique was performed and the gains in cycle count we observed from these kernels for sample runs in main and main10 profiles.

Selecting the kernels to accelerate

We selected over 1,000 kernels from the core compute We selected over 1,000 kernels from the core compute of x265 to optimize with Intel AVX-512 instructions for the main and main10 profiles. These kernels were chosen based on their resource requirements. Some kernels may require frequent memory access like different block-copy and block-fill kernels, while others may involve intense computation like DCT, iDCT, and quantization kernels. There is also a third class of kernels that involve a combination of both in varying proportions. We found that ensuring that the buffers that the assembly routines accessed were 64-byte aligned reduces cache misses and in general helps Intel AVX-512 kernels. A complete list of the kernels optimized with Intel AVX-512 instructions for main and main10 kernels are listed in Appendix A1 and A2 respectively.

Framework to evaluate cycle-count improvements

The x265 encoder implements a sample test bench as a correctness and performance measurement tool for assembly kernels. It accepts valid arguments for a given kernel and invokes the C primitive and corresponding assembly kernel and compares both output buffers. It verifies all possible corner cases for the given input type by using a randomly distributed set of values. Each assembly kernel is called 100 times and checked against its C primitive output for ensuring the correctness. To measure performance improvement, the test bench measures the difference in the clock ticks (as reported by the rdtsc instruction) between the assembly kernel and the C kernel for 1,000 runs and reports the average between them.

Cycle-Count improvement for kernels in the main and main10 profiles

Figure 2 shows the cycle-count improvements for each of the 500 kernels in the main profile and the 600+ kernels in the main10 profile that were accelerated with Intel AVX-512. In each curve, the kernels are sorted in increasing order of their cycle count gains over the corresponding Intel AVX-512 implementation. Appendix A details the per-kernel gains over Intel AVX2 in cycle counts.

On average, we saw a 33 percent and 40 percent gain in the cycle count over the Intel AVX2 kernels for kernels in the main and main10 profile respectively. The reason for the higher gains is as follows. In the main10 profile, x265 uses 16 bits to represent each pixel, as opposed to the main profile, which uses 8 bits; although main10 technically only needs 10 bits, using 16 bits simplifies all data structures in the software. Therefore, the amount of work that has to be done for the same number of pixels is doubled. Due the higher quantum of compute, kernels in the main10 profile gain more from Intel AVX-512 over Intel AVX2, than what the kernels in the main profile gain. These results from cycle counts indicate that at the kernel level, there is much benefit in using Intel AVX-512 to accelerate x265. However, this does not account for the reduction in clock frequency incurred when using Intel AVX-512 instructions compared to using Intel AVX2 instructions. In the next section, we look at the effect on overall encoding time, which also accounts for this effect.

Accelerating x265 Encoding with Intel Advanced Vector Extensions 512

In this section, we look at the impact of using Intel AVX-512 kernels for real encoding use cases with x265. Section "Test Setup" describes our test setup including the videos chosen, the x265 presets used, and the system configurations of the test machines. Section "Encoding on Intel® Core™ Processors" presents results on a workstation machine with an Intel Core i9-7900X processor, while section "Encoding on Intel Xeon Scalable Processors" presents results on a typical high-end server CPU that has two Intel Xeon Platinum 8180 processors.

Test setup

Our tests mainly focused on encoding 1080p videos with the main profile and 4K videos with the main10 profile. We used four typical 1080p clips (crowdrun, ducks_take_off, park_ joy, and old_town_cross), and three 4k clips (Netflix_Boat, Netflix_FoodMarket, and Netflix_Tango) for our tests 10. Appendix B gives a little more detail, along with screenshots of the videos used. We encode the 1080p to the main profile at the following bitrates (in Kbps): 1000, 3000, 5000, and 7000. For the 4K clips, the main10 profiles target the following bitrates (in Kbps): 8000, 10000, 12000, and 14000.

We encode these videos with a version of x265 that has all the kernels described in Section 3; these kernels were contributed as part of the default branch of x265. The kernels are disabled by default and may be enabled with the –asm avx512 option in the x265 command-line interface.

A graph
Figure 2. Cycle-count gains of the main and main10 profile Intel® Advanced Vector Extensions 512 kernels over the corresponding Intel® Advanced Vector Extensions 2 kernels.

We focused our experiments on four presets of x265 to represent the wide set of use cases that x265 presents: ultrafast, veryfast, medium, and veryslow. These presets represent a wide variety of trade-offs between encode efficiency and frames per second (FPS). The veryslow preset generates the most efficient encode but is the slowest; this preset is also the preferred choice for any offline encoding use cases such as OTT. The ultrafast preset is the quickest setting of x265 but generates the encode with the lowest efficiency. The veryfast and medium presets represent intermediate points in the trade-off between performance and encoder efficiency. Typically, the more efficient presets employ more tools of HEVC, resulting in more compute-per- pixel than the less efficient presets. This is important to call out as Intel AVX-512 kernels tend to give better speedup when the compute-per-pixel is higher, as shown from the results in the previous section.

Encoding on Intel® Core™ Processors

Figure 3 shows the performance of encoding 1080p and 4K video in main and main10 profile with Intel AVX-512 kernels relative to using Intel AVX2 kernels on a workstation-like configuration with an Intel Core i9-7900X processor using a single instance of x265. The full details of the system configuration are described in Appendix C. The single instance results in high utilization of the CPU across all configurations, representing a typical use case for this system when performing HEVC encoding.

Intel® Core™ i9-7900X Processor
Graph with performance metrics
Figure 3. Encoder performance from using Intel® Advanced Vector Extensions 512 kernels on a single instance of x265, as measured on a workstation-like system with an Intel® Core™ i9-7900X processor.

From the results, we see that for all profiles and presets, enabling Intel AVX-512 kernels results in a positive performance gains. On the Intel Core i9-7900X processor system, our measurements did not indicate any significant reduction in clock frequency. The cycle-count improvements from the kernels therefore directly reflect an increased encoder performance. When we observed the relative encoder performance per encode, we observed that there were no command lines that demonstrated lower performance with Intel AVX-512 than with Intel AVX2.

We therefore recommend that for the Intel Core i9-7900X processor, and similar systems where the frequency reduction is minimal, Intel AVX-512 kernels be enabled for all encoding profiles and resolutions when using x265.

Encoding on Intel Xeon Scalable Processors

In this section, we present results from using x265 accelerated by Intel AVX-512 on a high-end server configuration with two Intel Xeon Platinum 8180 processors arranged in a dual-socket configuration with 28 hyperthreaded cores per CPU. For full details of the system configuration, refer to Appendix C.

x265 single instance performance using 8 threads and 16 threads

Figure 4 shows the performance of a single instance of x265 with kernels that use Intel AVX-512 for encoding 1080p videos in the main profile and 4K videos in the main10 profile relative to using kernels that only use Intel AVX2 instructions. Two configurations, one with 8 threads per instance and another with 16 threads per instance, are shown in the graph to understand the impact of increasing the number of active cores on the CPU; limiting the number of threads for each instance is done using the --pools option of the x265 library.

The figure shows that for a given thread configuration, the gains when encoding 4K content in the main10 profile are higher than for the 1080p content in the main profile. Also, for a given resolution and profile, the gains that we see from the presets that have more work-per-pixel (the higher efficient presets like the veryslow preset) are higher than the faster presets; in fact, for 1080p content in the main profile, we see an average performance loss. These gains are consistent with previously observed results that demonstrate that the more the work per pixel of a specific configuration, the better it is to use Intel AVX-512. Additionally, when we investigated the S-curves of these profiles (not shown here for brevity), we saw that several encoder command lines outside the 4K main10 veryslow setting lost performance over Intel AVX2.

We therefore recommend using Intel AVX-512-enabled kernels only when doing 4K encodes in the main10 profile with the versylow preset. For other presets and encoder settings, the amount of work per pixel is insufficient to offset the reduction in clock frequency to the gains in cycle-count achieved.

One additional observation we can make from Figure 4 is that the performance gains are in general higher across the board when using 8 threads for the single instance of x265, compared to the 16 threads. Upon further analysis, we observe that when more cores are activated with Intel AVX- 512 instructions in the Intel Xeon Platinum 8180 processor, the frequency reduces further, resulting in lower gains from using Intel AVX-512 instructions. In a typical server, however, encoder vendors attempt to maximize all available CPU cores to get the maximum throughput out of the given server.
This use case is explored in Section 4.3.2 where we attempt to saturate the server with 4K main10 encodes to see if the lower frequency when more cores are activated may result in muting the gains.

Intel® Xeon® Platinum 8180 Processor
graph showing peformance benefits comparisons
Figure 4. Relative performance of a single instance of x265 when using Intel® Advanced Vector Extensions 512 kernels with 8 or 16 threads over Intel® Advanced Vector Extensions 2 kernels on a server configuration with two Intel® Xeon® Platinum 8180 processors.

Saturating Intel® Xeon® Platinum 8180 processors using multiple instances of x265

To study whether activating more cores results in performance loss for 4K encodes in the main10 profile, we saturated one and both CPUs of a dual-socket Intel Xeon Platinum 8180 processor-based server with four and eight instances of x265, respectively, with each instance using 16 threads. We measured the total FPS achieved by all x265 instances to encode the same clip at different bitrates when using kernels that use Intel AVX-512 and reported the number relative to when the Intel AVX2-enabled kernels were used. Figure 5 shows these results.

Intel® Xeon® Platinum 8180 processor - Single and Dual Socket Saturation
graph showing performance benefits comparisons
Figure 5. Single-socket and dual-socket saturation of theIntel® Xeon® Platinum 8180 processor with x265 instances.

Figure 5. Shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.

Figure 5 shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.

Conclusions and Future Work

In this paper, we presented our experience with using the Intel AVX-512 instructions available in the newly introduced Intel Xeon Scalable processors to accelerate the open-source HEVC encoder x265. The specific challenges that we had to overcome included selecting the right kernels to accelerate with Intel AVX-512 such that the reduction in CPU frequency were offset from the benefits in cycle count, and choosing the right encoder configuration that enabled the right balance of compute per pixel to achieve positive gains in encoder performance.

Recommendations

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

  • When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
  • For desktop and workstation SKUs (like the Intel Core i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations because the reduction in CPU clock frequency is rather low.
  • For server SKUs (like the Intel Xeon Platinum 8180 processor on which we tested), the frequency dip is higher, and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock- cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.

While the results and recommendations presented in this paper are not without the limitations of the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

Future work

The task of accelerating x265 with Intel AVX-512 has opened several avenues for future work. The accelerated kernels are available through the public mailing list. Future extensions of this work to enable further acceleration from Intel AVX-512 include (1) performing a thorough analysis of the use of Intel AVX-512 for videos at other resolutions and presets available in x265, (2) enabling schemes to dynamically enable and disable Intel AVX-512 kernels by monitoring the CPU frequency, and (3) a fundamental re-architecting of the encoder to segregate the worker threads into different types of threads, only some of which may run Intel AVX-512 limiting the number of cores where the CPU frequency drop is observed. We will continue to develop and contribute these solutions to open source, and encourage the reader to also contribute the project at http://x265.org.

Acknowledgements

This work was funded in part by a non-recurring engineering grant from Intel to MulticoreWare. We would like to thank the various developers and engineers at MulticoreWare for their extensive support throughout this work. In particular, we would like to thank Thomas A. Vaughan for his guidance and Min Chen for his expert comments on the assembly patches.

Appendix A

A1 – Main profile instructions per cycle (IPC) gains

Primitive IPC Gain Primitive IPC Gain Primitive IPC Gain Primitive IPC Gain
sad 0.16% i422 chroma_vss 32.70% i420 chroma_vpp 23.19% luma_vss 43.18%
pixelavg _pp 0.87% luma_vss 32.89% addAvg 23.37% luma_vss 43.35%
i444 chroma_vps 1.14% sad_x3 33.01% addAvg 23.38% i444 chroma_hpp 43.43%
i444 chroma_vps 1.18% luma_vps 33.05% i444 chroma_hps 23.53% ssd_s 43.57%
pixelavg _pp 1.41% i420 chroma_hpp 33.08% i420 chroma_hps 23.77% luma_hps 43.68%
convert_p2s 1.95% i444 chroma_hpp 33.14% var 23.95% luma_vss 43.75%
i420 chroma_vps 2.45% sad_x4 33.14% i420 chroma_hpp 24.03% luma_hps 43.84%
i420 chroma_vps 2.72% i444 chroma_vss 33.16% i422 chroma_vpp 24.11% luma_hps 43.94%
i422 chroma_hps 2.83% i420 chroma_vss 33.16% i444 chroma_vss 24.15% luma_vsp 44.06%
i420 p2s 3.21% copy _ps 33.33% i422 chroma_vss 24.15% luma_vsp 44.11%
i444 p2s 3.21% i420 copy _ps 33.33% i420 chroma_vss 24.15% sub_ps 44.11%
sad_x3 3.29% i444 chroma_vss 33.34% i420 chroma_vps 24.20% i444 chroma_hpp 44.15%
i420 chroma_vps 3.62% i422 chroma_vss 33.34% i444 chroma_vpp 24.20% convert_p2s 44.33%
sad_x4 4.50% i420 chroma_vss 33.34% i420 chroma_vpp 24.20% i444 chroma_hpp 44.35%
sad 4.62% i422 copy _ps 33.43% sad 24.21% luma_vss 44.42%
i420 chroma_hps 4.90% i444 chroma_vss 33.43% i444 chroma_vps 24.22% luma_hps 44.43%
i420 chroma_hps 5.19% i422 chroma_vss 33.43% i420 chroma_vps 24.22% luma_hpp 44.48%
pixel_satd 5.42% i420 chroma_hpp 33.55% i444 chroma_hps 24.25% luma_vpp 44.54%
i444 chroma_vps 5.43% i422 chroma_hpp 33.57% i420 chroma_hpp 24.42% luma_vss 44.61%
i422 chroma_hps 5.82% dequant_normal 33.60% sad_x4 24.53% cpy1Dto2D_shl 44.61%
i444 chroma_vps 6.78% sad_x4 33.62% i444 chroma_hps 24.57% luma_vsp 44.62%
dct 7.06% i444 chroma_vss 33.89% i422 chroma_hps 24.65% luma_vsp 44.66%
i444 chroma_hps 7.08% i420 chroma_vss 33.89% psyCost_pp 24.89% luma_vss 44.70%
i444 chroma_hps 7.26% sad_x3 33.92% i422 chroma_vps 25.00% luma_vpp 44.74%
i422 chroma_vss 8.85% i420 pixel_satd 34.01% i444 chroma_vss 25.17% luma_vsp 44.85%
luma_vss 9.76% i444 chroma_hps 34.02% i422 chroma_vss 25.17% i422 copy _sp 45.20%
i422 chroma_hps 10.27% luma_vps 34.04% i420 chroma_vss 25.17% getResidual32 45.24%
i444 chroma_hps 11.00% i444 chroma_hpp 34.20% i422 chroma_vps 25.66% luma_vpp 45.30%
i444 chroma_hps 11.14% i420 pixel_satd 34.20% luma_vps 25.82% luma_hps 45.35%
sad 11.26% i420 chroma_hpp 34.23% i444 chroma_vps 25.89% i444 chroma_hpp 45.41%
i420 chroma_hps 11.38% i444 chroma_vss 34.43% i444 chroma_vps 25.92% luma_hpp 45.49%
pixel_sa8d 11.55% i422 chroma_vss 34.43% i420 chroma_hps 25.95% convert_p2s 45.52%
i444 chroma_hps 11.91% i420 chroma_vss 34.43% i420 chroma_vps 26.07% luma_hps 45.58%
luma_vpp 11.96% i422 chroma_vsp 34.59% convert_p2s 26.25% luma_vpp 45.62%
i422 chroma_hps 12.10% i444 chroma_vss 34.71% i422 chroma_vps 26.42% convert_p2s 45.62%
copy _pp 12.54% i444 chroma_vss 34.76% i444 chroma_vps 26.56% luma_vpp 45.69%
ssd_s 12.58% addAvg 34.88% i444 chroma_vss 26.71% cpy2Dto1D_shl 45.75%
i420 chroma_vps 12.58% addAvg 35.14% i422 chroma_vss 26.71% i422 addAvg 45.76%
i444 chroma_hps 12.79% sad 35.43% i420 chroma_vss 26.71% convert_p2s 46.00%
idct 13.32% ssd_ss 35.45% sad_x4 26.80% i420 add_ps 46.09%
luma_vps 13.78% i444 chroma_vss 35.51% i422 chroma_hpp 27.06% add_ps 46.10%
i444 chroma_hps 13.87% i420 pixel_satd 35.55% i422 chroma_hps 27.13% luma_vsp 46.14%
sad 13.88% pixelavg _pp 35.56% luma_hpp 27.15% luma_hps 46.29%
copy _cnt 14.25% luma_vpp 35.62% i420 pixel_satd 27.23% luma_vss 46.31%
luma_vpp 14.28% luma_vpp 36.21% i444 chroma_vss 27.24% i444 chroma_vsp 46.52%
pixel_satd 14.45% i420 chroma_hpp 36.45% i422 chroma_vss 27.24% i422 chroma_vsp 46.52%
idct 14.49% i422 chroma_hpp 36.65% luma_hpp 27.29% i420 chroma_vsp 46.52%
pixel_satd 14.92% i422 chroma_hpp 36.76% luma_vps 27.45% luma_hps 46.65%
pixel_satd 14.99% sad 36.76% psyCost_pp 27.62% pixelavg _pp 46.67%
sad 15.21% i422 chroma_hpp 36.81% luma_vsp 27.72% luma_vss 46.88%
idct 15.23% copy _pp 36.82% i422 chroma_hps 28.00% i422 addAvg 46.88%
sad_x3 15.32% pixelavg _pp 36.84% pixel_satd 28.50% luma_hps 46.90%
i444 chroma_vpp 15.47% convert_p2s 36.87% cpy2Dto1D_shl 28.69% luma_vsp 46.97%
i422 chroma_vpp 15.47% i420 p2s 36.87% luma_vps 28.71% i422 p2s 47.10%
i420 chroma_vpp 15.47% i444 p2s 36.87% i444 chroma_hpp 28.78% copy _pp 47.11%
pixel_satd 15.52% i444 chroma_hpp 37.07% i420 pixel_satd 28.80% luma_vss 47.64%
pixel_satd 15.62% luma_vpp 37.11% i422 pixel_satd 28.81% i444 chroma_hpp 47.83%
pixel_satd 15.66% luma_vss 37.49% i422 pixel_satd 28.95% i422 addAvg 47.85%
sad_x3 15.70% addAvg 37.76% luma_vss 29.26% luma_hps 48.46%
pixel_satd 15.75% i444 chroma_vps 37.90% i444 chroma_vss 29.29% copy _ps 48.57%
i420 chroma_hps 15.83% i444 chroma_vss 38.04% i420 chroma_hps 29.42% sub_ps 48.83%
copy _pp 15.93% i444 chroma_vps 38.05% luma_vpp 29.43% luma_hpp 48.97%
luma_vpp 16.10% i444 chroma_vps 38.23% scale1D_128to64 29.50% i422 add_ps 49.02%
nquant 16.33% sad 38.42% luma_vss 29.59% i444 chroma_vsp 49.43%
sad 16.35% i444 chroma_hpp 38.45% i444 chroma_vpp 29.69% i420 sub_ps 49.46%
i444 chroma_vpp 16.39% Weight_sp 38.48% i422 chroma_vpp 29.69% add_ps 49.50%
i420 chroma_hps 16.60% i444 chroma_hpp 38.55% i420 chroma_vpp 29.69% i422 sub_ps 49.52%
i444 chroma_vpp 17.02% sad 38.56% i422 chroma_hps 29.71% i420 addAvg 49.74%
i422 chroma_vpp 17.02% luma_hpp 38.79% i422 pixel_satd 29.75% convert_p2s 49.75%
i420 chroma_vpp 17.02% pixel_satd 39.15% i444 chroma_vpp 29.82% i422 p2s 49.75%
pixel_satd 17.08% luma_hpp 39.21% i422 chroma_vpp 29.82% i444 p2s 49.75%
luma_vps 17.10% i444 chroma_hpp 39.30% luma_vss 29.91% luma_vss 49.84%
luma_vps 17.36% i444 chroma_vps 39.39% i444 chroma_vss 29.92% luma_hpp 50.00%
i444 chroma_vss 17.55% addAvg 39.51% i422 chroma_vss 29.92% copy _sp 50.11%
i420 chroma_vss 17.55% i420 chroma_hpp 39.55% i420 chroma_vss 29.92% luma_vss 50.22%
pixel_satd 17.59% i422 pixel_satd 39.57% luma_vps 30.19% luma_hpp 50.61%
pixel_satd 17.66% i422 chroma_hpp 39.61% sad_x4 30.24% luma_hpp 51.19%
i444 chroma_vss 18.42% convert_p2s 39.78% sad 30.30% i444 chroma_vsp 51.23%
i422 chroma_vss 18.42% i420 p2s 39.78% luma_vps 30.37% luma_hpp 51.70%
i420 chroma_vss 18.42% i422 p2s 39.78% luma_vps 30.39% nonPsyRdoQuant 51.74%
i444 chroma_vpp 18.49% i444 p2s 39.78% i444 chroma_vpp 30.39% i444 chroma_vsp 52.08%
i420 chroma_vpp 18.49% copy _sp 39.93% i422 chroma_vpp 30.39% copy _pp 52.17%
luma_vps 18.50% i420 addAvg 40.02% i420 chroma_vpp 30.39% i444 chroma_vsp 52.22%
luma_vpp 18.51% luma_hps 40.04% ssd_ss 30.44% i444 chroma_vsp 52.28%
sad_x3 18.99% i444 chroma_hpp 40.07% i422 chroma_hpp 30.45% nonPsyRdoQuant 52.32%
copy _pp 19.76% addAvg 40.64% i420 pixel_satd 30.53% i422 copy _ss 52.45%
luma_vss 19.80% luma_vsp 40.87% i422 chroma_vpp 30.54% nonPsyRdoQuant 52.56%
pixel_satd 19.89% i444 chroma_vsp 40.96% i444 chroma_hpp 30.54% i444 chroma_vsp 52.77%
sad 20.09% i420 chroma_vsp 40.96% i422 chroma_hpp 30.56% i422 chroma_vsp 52.77%
sad_x3 20.26% luma_vss 41.01% i444 chroma_hpp 30.63% blockfill_s 52.93%
i444 chroma_hps 20.52% i420 copy _sp 41.12% i420 chroma_hpp 30.85% i444 chroma_vsp 53.30%
i420 chroma_hps 20.80% copy _cnt 41.14% luma_vsp 30.95% i422 chroma_vsp 53.30%
psyCost_pp 21.15% luma_vsp 41.16% sad_x4 30.95% i420 chroma_vsp 53.30%
i444 chroma_hps 21.17% Weight_pp 41.23% i422 chroma_vss 30.99% i422 chroma_vsp 53.36%
pixel_satd 21.19% luma_hps 41.42% i444 chroma_hps 31.12% i444 chroma_vsp 54.34%
pixel_satd 21.21% addAvg 41.84% i444 chroma_vpp 31.17% i422 chroma_vsp 54.34%
quant 21.23% i420 addAvg 41.87% i444 chroma_vpp 31.20% i420 chroma_vsp 54.34%
sad_x3 21.29% luma_vsp 41.99% sad 31.29% psyRdoQuant 54.44%
i444 chroma_vpp 21.42% luma_hps 42.05% luma_vsp 31.33% luma_hpp 54.62%
i422 chroma_vpp 21.42% convert_p2s 42.13% sad_x3 31.34% i444 chroma_vsp 54.64%
i420 chroma_vpp 21.42% i420 p2s 42.13% i422 pixel_satd 31.46% i420 chroma_vsp 54.64%
i420 chroma_vps 21.60% i422 p2s 42.13% luma_hps 31.52% luma_hpp 54.78%
pixel_satd 21.61% i444 p2s 42.13% i444 chroma_vpp 31.57% luma_hpp 55.06%
i444 chroma_vps 21.69% i444 chroma_vsp 42.31% pixelavg _pp 31.62% luma_hpp 55.40%
i422 chroma_hps 21.99% i422 chroma_vsp 42.31% luma_vps 31.76% copy _pp 55.41%
i420 addAvg 22.01% i420 chroma_vsp 42.31% i444 chroma_hps 31.78% psyRdoQuant 55.70%
luma_vsp 22.09% luma_vsp 42.35% sad_x3 31.95% psyRdoQuant 55.72%
i444 chroma_vps 22.27% i420 chroma_hpp 42.43% i444 chroma_vss 31.96% var 55.75%
i422 chroma_vps 22.41% nonPsyRdoQuant 42.51% i420 chroma_vss 31.96% copy _ss 56.00%
sad_x4 22.44% luma_hps 42.54% i422 chroma_vss 32.01% i444 chroma_vsp 56.36%
var 22.51% addAvg 42.56% i444 chroma_hpp 32.12% i422 chroma_vsp 56.36%
i444 chroma_vpp 22.64% luma_hps 42.58% var 32.17% i420 chroma_vsp 56.36%
i420 chroma_vpp 22.64% luma_vss 42.82% i420 chroma_hpp 32.32% i420 copy _ss 56.63%
sad_x4 22.84% i422 addAvg 42.93% i444 chroma_hps 32.44% i444 chroma_vsp 57.60%
i444 chroma_vpp 22.87% luma_vpp 42.97% luma_vsp 32.61% i420 chroma_vsp 57.60%
i422 chroma_vpp 22.87% dequant_scaling 42.98% i444 chroma_vss 32.67% copy _pp 58.33%
i422 chroma_hpp 22.92% luma_hpp 42.99% i420 chroma_vss 32.67% copy _ss 60.09%
sad_x4 23.09% i444 chroma_vsp 43.05% i444 chroma_vss 32.69% psyRdoQuant 62.80%
i444 chroma_vpp 23.19% i422 chroma_vsp 43.05% i422 chroma_vss 32.69% i444 chroma_vsp 62.98%
        i420 chroma_vss 32.69% i420 chroma_vsp 62.98%

A2 – Main10 profile IPC gains

Primitive IPC Gain Primitive IPC Gain Primitive IPC Gain Primitive IPC Gain
convert_p2s 1.26% i422 chroma_hps 39.92% i422 chroma_vpp 29.64% i444 chroma_hpp 49.20%
i420 p2s 1.26% i422 p2s 40.30% i420 chroma_vpp 29.64% i444 chroma_hps 49.45%
i444 p2s 1.26% luma_hpp 40.35% i444 chroma_vsp 29.82% cpy2Dto1D_shl 49.70%
addAvg 1.86% i422 chroma_hpp 40.52% i422 chroma_vsp 29.82% luma_hvpp 49.80%
addAvg 6.88% copy _cnt 40.55% i420 chroma_vsp 29.82% luma_vss 49.84%
dct 7.06% luma_vpp 40.58% luma_vss 29.91% i420 chroma_hps 49.85%
sad_x3 7.65% luma_vsp 40.59% i444 chroma_vss 29.92% convert_p2s 49.87%
sad 7.74% i444 chroma_vps 40.60% i422 chroma_vss 29.92% i420 p2s 49.87%
sad 8.29% i422 chroma_vps 40.60% i420 chroma_vss 29.92% i422 p2s 49.87%
i420 addAvg 8.36% i420 chroma_vps 40.60% i444 chroma_vps 29.93% i422 p2s 49.87%
sad_x3 8.77% sad_x3 40.64% i422 chroma_vps 29.93% i444 p2s 49.87%
luma_vss 9.76% nonPsyRdoQuant 40.70% i420 chroma_vps 29.93% luma_hps 49.94%
intra_pred_ang27 9.79% add_ps 40.71% luma_vsp 30.06% i422 chroma_hps 50.07%
cpy2Dto1D_shl 10.13% sad_x4 40.73% i444 chroma_vsp 30.11% i444 chroma_hpp 50.13%
sad_x3 10.81% luma_vpp 40.73% i422 chroma_vsp 30.11% luma_vss 50.22%
sad_x4 10.96% copy _pp 40.81% i420 chroma_vsp 30.11% luma_hpp 50.25%
i420 addAvg 11.05% i422 chroma_vps 40.88% pixel_satd 30.30% i420 chroma_vpp 50.28%
pixel_satd 11.05% luma_vss 41.01% i422 pixel_satd 30.30% luma_hps 50.67%
i420 pixel_satd 11.05% i444 chroma_vsp 41.02% i422 pixel_satd 30.35% addAvg 50.67%
i422 pixel_satd 11.05% i420 chroma_vsp 41.02% add_ps 30.69% i422 addAvg 50.67%
luma_vsp 12.64% i444 chroma_vsp 41.05% sad 30.94% luma_hpp 50.75%
copy _cnt 13.29% i420 chroma_vsp 41.05% dequant_normal 31.10% i420 chroma_hpp 50.82%
idct 13.32% sad 41.06% sad 31.37% copy _pp 50.95%
i444 chroma_vps 14.44% intra_pred_ang34 41.06% pixel_satd 31.43% i422 addAvg 50.99%
i422 chroma_vps 14.44% convert_p2s 41.09% i420 pixel_satd 31.43% luma_hps 51.17%
i420 chroma_vps 14.44% i444 p2s 41.09% i422 pixel_satd 31.43% i422 chroma_hpp 51.22%
idct 14.49% nonPsyRdoQuant 41.21% i444 chroma_vpp 31.60% i444 chroma_hpp 51.37%
i444 chroma_vpp 14.84% sad_x4 41.22% i422 chroma_vss 31.76% luma_hpp 51.48%
idct 15.23% i422 chroma_vpp 41.25% i444 chroma_vss 31.96% luma_hps 51.57%
luma_vsp 15.24% i420 chroma_vpp 41.25% i420 chroma_vss 31.96% copy _ss 51.58%
sad_x3 15.53% i420 chroma_vpp 41.36% sad 31.99% luma_hpp 51.63%
addAvg 15.60% i444 chroma_vsp 41.40% psyCost_pp 32.12% luma_hps 51.64%
i422 chroma_vpp 15.71% luma_vpp 41.43% i420 chroma_hps 32.32% luma_hps 51.65%
i420 chroma_vpp 15.71% luma_hvpp 41.46% i422 addAvg 32.46% luma_hps 51.70%
addAvg 15.90% luma_vpp 41.48% i422 chroma_vss 32.62% luma_hps 51.81%
i422 chroma_vpp 16.07% i444 chroma_vsp 41.51% i444 chroma_vss 32.67% i422 chroma_hpp 51.86%
intra_pred_ang25 16.22% luma_hvpp 41.54% i420 chroma_vss 32.67% luma_hps 51.89%
nquant 16.33% intra_pred_ang11 41.55% i444 chroma_vss 32.69% addAvg 51.89%
sad_x4 16.42% convert_p2s 41.58% i422 chroma_vss 32.69% i420 addAvg 51.89%
luma_vsp 16.55% sad_x4 41.71% i420 chroma_vss 32.69% i422 addAvg 51.89%
i420 addAvg 17.12% sad_x4 41.71% luma_vss 32.89% luma_hps 51.93%
sad_x4 17.33% luma_vsp 41.78% i444 chroma_vsp 33.14% luma_hps 51.99%
i444 chroma_vss 17.55% sad_x4 41.83% i422 chroma_vsp 33.14% i444 chroma_hpp 52.16%
i420 chroma_vss 17.55% i444 chroma_vsp 42.01% i444 chroma_vss 33.16% i422 copy _sp 52.45%
i444 chroma_vps 17.88% i444 chroma_vsp 42.08% i420 chroma_vss 33.16% i422 copy _ps 52.45%
i422 chroma_vps 17.88% i422 chroma_vsp 42.08% convert_p2s 33.27% i422 copy _ss 52.45%
i420 chroma_vps 17.88% nonPsyRdoQuant 42.13% i444 chroma_vss 33.34% i444 chroma_hps 52.94%
pixel_satd 18.02% pixelavg _pp 42.17% i422 chroma_vss 33.34% copy _ss 53.20%
i422 addAvg 18.13% i422 chroma_vpp 42.20% i420 chroma_vss 33.34% i420 chroma_hps 53.22%
i444 chroma_vss 18.42% i420 chroma_vpp 42.20% i444 chroma_vss 33.43% i422 chroma_hps 53.27%
i422 chroma_vss 18.42% luma_vps 42.30% i422 chroma_vss 33.43% i420 chroma_hpp 53.48%
i420 chroma_vss 18.42% sub_ps 42.52% pixelavg _pp 33.45% copy _pp 53.53%
addAvg 19.50% luma_vsp 42.55% pixel_satd 33.45% i422 chroma_hpp 53.81%
i444 chroma_vps 19.54% luma_hvpp 42.65% i420 pixel_satd 33.45% i422 chroma_hpp 53.89%
i422 chroma_vps 19.54% pixelavg _pp 42.65% addAvg 33.46% i444 chroma_hpp 54.31%
i420 chroma_vps 19.54% luma_vps 42.72% luma_vsp 33.47% ssd_ss 54.69%
sad_x3 19.75% convert_p2s 42.77% sad_x4 33.51% i422 chroma_hpp 54.77%
luma_vss 19.80% luma_vss 42.82% i444 chroma_vsp 33.79% i420 chroma_hpp 55.18%
i422 pixel_satd 19.95% luma_vsp 43.05% i422 chroma_vsp 33.79% luma_hpp 55.53%
pixel_satd 20.02% convert_p2s 43.11% i420 chroma_vsp 33.79% i444 chroma_hpp 55.56%
i420 pixel_satd 20.02% i444 chroma_hpp 43.15% i444 chroma_vss 33.89% i444 chroma_hpp 55.78%
i422 pixel_satd 20.02% luma_vsp 43.17% i420 chroma_vss 33.89% i444 chroma_hpp 55.94%
i444 chroma_vps 20.09% luma_vss 43.18% luma_vsp 34.08% luma_hpp 55.96%
i420 chroma_vps 20.09% luma_vsp 43.22% sub_ps 34.13% copy _sp 56.00%
i422 chroma_vss 20.53% luma_hvpp 43.24% i444 chroma_vsp 34.18% copy _ps 56.00%
sad_x4 20.69% luma_vss 43.35% i420 chroma_vsp 34.18% i444 chroma_hpp 56.07%
i444 chroma_vps 20.86% luma_vsp 43.36% i444 chroma_vsp 34.22% luma_hpp 56.16%
i422 chroma_vps 20.86% i420 chroma_hpp 43.38% i422 chroma_vsp 34.22% i420 copy _sp 56.63%
i444 chroma_vpp 20.98% cpy1Dto2D_shl 43.50% i420 chroma_vsp 34.22% i420 copy _ps 56.63%
quant 21.23% luma_vsp 43.50% i444 chroma_vss 34.43% i420 copy _ss 56.63%
i422 chroma_vpp 21.45% luma_vpp 43.51% i422 chroma_vss 34.43% i422 chroma_hpp 57.32%
sad 21.61% copy _pp 43.54% i420 chroma_vss 34.43% i444 chroma_hps 57.33%
i444 chroma_vpp 21.78% luma_hvpp 43.57% pixel_satd 34.59% luma_hpp 57.40%
i444 chroma_vps 22.06% luma_vpp 43.58% i444 chroma_vss 34.71% i420 chroma_hps 57.97%
i420 chroma_vps 22.06% luma_hvpp 43.60% i444 chroma_vss 34.76% luma_hpp 58.55%
i444 chroma_vsp 22.12% luma_vss 43.75% intra_pred_ang10 34.76% i444 chroma_hps 59.21%
i422 chroma_vsp 22.12% luma_vps 43.77% i444 chroma_vps 34.80% i420 chroma_hps 59.46%
i420 chroma_vsp 22.12% i444 chroma_vsp 43.80% i444 chroma_vps 34.98% blockfill_s 59.53%
i444 chroma_vsp 22.14% i420 chroma_vsp 43.80% luma_vps 35.07% luma_hpp 59.56%
i422 chroma_vsp 22.14% pixelavg _pp 43.94% i444 chroma_vps 35.34% i422 chroma_hps 59.75%
i420 chroma_vsp 22.14% psyRdoQuant 44.02% Weight_pp 35.37% copy _sp 60.09%
i422 chroma_vpp 22.28% sad_x3 44.17% i444 chroma_vss 35.51% copy _ps 60.09%
i420 chroma_vpp 22.28% pixelavg _pp 44.23% luma_vps 35.63% luma_hps 60.23%
i444 chroma_vpp 22.28% luma_hvpp 44.24% i422 chroma_hps 35.68% psyRdoQuant 60.25%
i422 chroma_vpp 22.35% luma_hvpp 44.28% i444 chroma_vps 36.38% luma_hpp 60.26%
ssd_ss 22.60% luma_vsp 44.31% i422 chroma_vss 36.56% i444 chroma_hps 60.28%
i444 chroma_vpp 23.06% dequant_scaling 44.37% sad 36.66% i420 chroma_hps 60.48%
sad_x4 23.09% convert_p2s 44.40% luma_vpp 36.68% luma_hps 60.76%
luma_vpp 23.67% luma_vpp 44.41% i444 chroma_vpp 36.70% copy _pp 60.87%
luma_vpp 23.82% luma_vss 44.42% luma_vsp 36.71% i444 chroma_hps 60.92%
i444 chroma_vpp 23.84% sad_x4 44.42% sad_x3 36.75% i422 chroma_hps 61.09%
i444 chroma_vss 24.15% luma_vpp 44.60% sad_x4 36.78% luma_hpp 61.28%
i422 chroma_vss 24.15% luma_vss 44.61% pixel_satd 36.88% i444 chroma_hpp 61.38%
i420 chroma_vss 24.15% luma_hvpp 44.61% i422 chroma_vpp 36.91% luma_hpp 61.43%
intra_pred_ang9 24.37% getResidual32 44.64% copy _pp 36.96% luma_hpp 61.44%
i444 chroma_vpp 24.41% luma_hpp 44.68% addAvg 37.08% i422 chroma_hps 61.55%
luma_vpp 24.48% luma_vss 44.70% sad_x4 37.09% luma_hpp 61.58%
i422 addAvg 24.62% luma_hvpp 44.73% i420 chroma_vpp 37.29% luma_hpp 62.26%
psyCost_pp 24.88% i444 chroma_vsp 44.76% i422 chroma_vpp 37.36% i422 chroma_hps 62.31%
i420 chroma_vpp 24.90% i422 chroma_vsp 44.76% i420 chroma_vpp 37.36% luma_hpp 62.35%
i422 chroma_vpp 25.11% i420 chroma_vsp 44.76% luma_vss 37.49% i420 chroma_hpp 62.39%
i420 chroma_vpp 25.11% sad_x4 44.85% luma_vpp 37.53% i420 chroma_hps 62.39%
i444 chroma_vps 25.17% luma_hvpp 45.15% i444 chroma_vps 37.54% i444 chroma_hpp 62.46%
i422 chroma_vps 25.17% luma_vps 45.19% i422 chroma_vps 37.54% luma_hpp 62.63%
i420 chroma_vps 25.17% i422 chroma_hpp 45.23% i420 chroma_vps 37.54% i444 chroma_hps 62.88%
i444 chroma_vss 25.17% intra_pred_dc 45.26% i444 chroma_vpp 37.59% i420 chroma_hps 62.95%
i422 chroma_vss 25.17% sad 45.31% i420 chroma_vpp 37.59% luma_hpp 63.07%
i420 chroma_vss 25.17% luma_vps 45.36% i444 chroma_vps 37.59% i444 chroma_hps 63.15%
i422 chroma_vps 25.28% psyRdoQuant 45.40% i422 chroma_vps 37.59% luma_hps 63.16%
i444 chroma_vps 25.97% i420 add_ps 45.40% pixel_satd 37.60% i420 chroma_hpp 63.34%
i422 chroma_vps 25.97% pixelavg _pp 45.52% i444 chroma_vps 37.60% luma_hpp 63.61%
i420 chroma_vps 25.97% addAvg 45.54% i420 chroma_vps 37.60% i420 chroma_hps 63.85%
luma_vpp 26.22% i420 addAvg 45.54% i444 chroma_vsp 37.66% luma_hpp 63.91%
sad 26.25% i422 addAvg 45.54% i422 chroma_vps 37.68% i420 chroma_hpp 64.12%
psyCost_pp 26.30% i444 chroma_vsp 45.57% i444 chroma_vpp 37.69% i444 chroma_hps 64.15%
i444 chroma_vsp 26.38% i422 chroma_vsp 45.57% i444 chroma_vps 37.71% i444 chroma_hpp 64.23%
i420 chroma_vsp 26.38% i420 chroma_vsp 45.57% i420 chroma_vps 37.71% i422 chroma_hpp 64.39%
i420 addAvg 26.39% luma_vps 45.58% convert_p2s 37.73% i422 chroma_hpp 64.56%
i422 addAvg 26.39% pixelavg _pp 45.61% i420 p2s 37.73% i444 chroma_hps 64.84%
pixel_satd 26.62% luma_vps 45.62% i422 p2s 37.73% i422 chroma_hps 64.87%
i444 chroma_vss 26.71% luma_vps 45.64% i444 p2s 37.73% i444 chroma_hpp 64.92%
i422 chroma_vss 26.71% sad_x3 45.65% i444 chroma_vpp 37.74% i420 chroma_hps 64.93%
i420 chroma_vss 26.71% i422 add_ps 45.68% i444 chroma_vpp 37.76% i422 chroma_hpp 65.05%
luma_vsp 26.77% addAvg 45.72% addAvg 37.80% i444 chroma_hps 65.06%
luma_vps 27.04% i420 addAvg 45.72% i422 chroma_vpp 37.99% i420 chroma_hpp 65.14%
luma_vpp 27.10% pixelavg _pp 45.80% i444 chroma_vss 38.04% i422 chroma_hps 65.35%
i444 chroma_vss 27.24% i444 chroma_hpp 45.95% i420 chroma_hpp 38.04% i422 chroma_hps 65.63%
i422 chroma_vss 27.24% psyRdoQuant 45.96% luma_vps 38.08% i444 chroma_hps 65.72%
i422 chroma_vps 27.26% luma_vsp 45.97% i444 chroma_vpp 38.09% i422 chroma_hpp 65.80%
i420 addAvg 27.28% sad 46.04% i444 chroma_vpp 38.27% i444 chroma_hpp 65.88%
i422 addAvg 27.28% luma_hvpp 46.17% i422 chroma_vpp 38.27% i420 chroma_hpp 65.92%
addAvg 27.55% luma_vss 46.31% i444 chroma_hps 38.30% i420 chroma_hpp 65.94%
i422 chroma_vpp 27.71% sad_x3 46.36% intra_pred_ang2 38.34% i444 chroma_hps 66.03%
i420 chroma_vpp 27.71% sad_x3 46.42% i444 chroma_hps 38.37% i422 chroma_hps 66.03%
pixel_satd 27.93% luma_vps 46.44% i444 chroma_vpp 38.48% i420 chroma_hps 66.15%
ssd_s 28.04% luma_hpp 46.46% copy _pp 38.51% i422 chroma_hpp 66.20%
pixel_satd 28.10% i444 chroma_vsp 46.66% addAvg 38.54% i422 chroma_hps 66.20%
pixelavg _pp 28.47% sad_x3 46.71% nonPsyRdoQuant 38.57% i420 chroma_hps 66.29%
i420 pixel_satd 28.54% luma_hpp 46.82% sad_x3 38.74% i422 chroma_hpp 66.32%
i422 pixel_satd 28.54% luma_vss 46.88% sad_x3 38.80% i444 chroma_hpp 66.38%
pixel_satd 28.56% i422 chroma_hps 46.99% sad 38.84% i444 chroma_vpp 66.41%
i420 pixel_satd 28.56% intra_pred_ang26 47.26% Weight_sp 38.86% i444 chroma_hps 66.50%
i422 pixel_satd 28.56% luma_vps 47.31% pixel_satd 38.88% i444 chroma_vpp 66.61%
i444 chroma_vps 28.75% luma_hvpp 47.44% i420 pixel_satd 38.88% i444 chroma_vpp 66.63%
luma_vps 28.78% pixelavg _pp 47.50% copy _pp 38.96% i444 chroma_hps 66.64%
luma_vps 28.82% luma_vss 47.64% i422 sub_ps 39.19% i444 chroma_hpp 66.64%
i422 chroma_hps 28.86% luma_vps 47.69% i420 sub_ps 39.34% i420 chroma_hpp 66.64%
i420 chroma_hps 29.02% i420 chroma_hpp 47.78% i420 chroma_hps 39.47% i420 chroma_hpp 66.65%
sad_x3 29.04% i422 chroma_hps 47.82% luma_vpp 39.54% i444 chroma_hps 66.71%
i444 chroma_hps 29.11% luma_vsp 47.93% luma_hvpp 39.63% i422 chroma_hpp 66.71%
luma_vsp 29.13% luma_hvpp 48.30% i444 chroma_vps 39.68% i444 chroma_hps 66.75%
luma_vss 29.26% addAvg 48.40% i420 chroma_vps 39.68% i444 chroma_hps 66.91%
i444 chroma_vss 29.29% i420 addAvg 48.40% luma_hpp 39.72% i422 chroma_hpp 66.92%
luma_vpp 29.39% luma_hps 48.96% addAvg 39.77% i444 chroma_hpp 67.59%
luma_vss 29.59% luma_hps 49.05% convert_p2s 39.79% i444 chroma_hpp 67.78%
        i420 p2s 39.79% i420 chroma_hpp 69.14%
        i444 p2s 39.79% i444 chroma_hpp 69.23%

Appendix B

1080p Test Clips and Bitrates Used

The following 1080p clips were used for generating test results.

passerby in a verdant sunny park
park_ joy _1080p.y4m

large crowd of joggers in a park
crowd_run_1080p50.y4m

ducks  loligagging in a blue pond
ducks_take_off_1080p50.y4m

Urban landscape of old European city
old_town_cross_1080p50.y4m

4k Test Clips and Bitrates Used

The following 4k clips were used for generating test results.

vacation panaroma
Netflix_Boat_4096x2160_60fps_10bit_420.y4m

Tango afficionados
Netflix_Tango_4096x2160_60fps_10bit_420.y4m

a rural open market
Netflix_FoodMarket_4096x2160_60fps_10bit_420.y4m

 

Appendix C

Configurations for Testing on Intel® Core™ i7-4500U Processor
System Attribute Value
OS Name Windows 10 professional
Version 10.0.16299 Build 16299
System Model MS-7A93
System Type x64-based PC
Processor Intel® Core™ i7-
4500U CPU @
3.30GHz, 3312 MHz,
10 Core(s), 20 Logical
Processor(s)
Core(s) per socket: 2
Thread(s) per core: 2
Socket(s): 1
NUMA node(s): 1
   
BIOS
BIOS Version/Date American
Megatrends Inc.
1.00, 6/2/2017
SMBIOS Version 3
BIOS Mode UEFI
   
Graphic Interface:
Version PCI-Express
Link Width x16
Max. Supported x16
   
Memory:
Type DDR3
Channel 1
Size 8 GB
DRAM Frequency 800 MHz
command Rate (CR) 2T
Configurations for Testing on Intel® Core™ i9-7900X Processor
System Attribute Value
OS Name Microsoft Windows 10 Enterprise
Version 110.0.16299 Build 16299
System Model MS-7A93
System Type x64-based PC
Processor Intel® Core™ i9-7900X CPU at 3.30GHz, 3312Mhz, 10 Core(s), 20 Logical Processor(s)
Core(s) per socket: 10
Thread(s) per core: 2
Socket(s): 1
NUMA node(s): 1
   
BIOS
BIOS Version/Date American
Megatrends Inc.
1.00, 6/2/2017
SMBIOS Version 3
BIOS Mode UEFI
   
Graphic Interface:
Version PCI-Express
Link Width x16
Max. Supported x16
   
Memory:
Type DDR4
Channel 2
Size 32 GB
DRAM Frequency 1066.8 MHz
command Rate (CR) 2T
Configurations for Testing on Intel® Xeon® Platinum 8180 Processor
System Attribute Value
OS Name CentOS
Version 7.2
System Model Intel S4PR1SY2B
System Type x86_64
Processor Intel® Xeon® Platinum 8180 CPU at 2.50 GHz
Core(s) per socket: 28
Thread(s) per core: 2
Socket(s): 2
NUMA node(s): 2
   
BIOS
BIOS Version/Date SE5C620.86B.0X. 01.0007.062120172 125 / 06/21/2017
SMBIOS Version 2.8
BIOS Mode UEFI
   
Graphic Interface:
Version PCI-Express
Link Width x16
Max. Supported x16
   
Memory:
Type DDR4
Channel 2
Size 192 GB
DRAM Frequency 1333 MHz
command Rate (CR) 2T

References

  1. David A. Patterson and John L. Hennessey, Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751.
  2. VideoLAN Organization, x264, The best H.264/AVC encoder. https://www.videolan.org/developers/x264.html
  3. MulticoreWare Inc., x265 HEVC Encoder/H.265 Video Codec. http://x265.org/
  4. G. J. Sullivan, J.-R. Ohm, W.-J. Han and T. Wigand, "Overview of the High Efficiency Video Coding (HEVC) Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12,pp. 1649-1668, 2012.
  5. Intel Corporation, Intel Advanced Vector Instructions 512. https://www.intel.in/content/www/in/en/architecture-and-technology/avx-512-overview.html
  6. Intel Corporation, "Intel® Xeon® Processor Scalable Family Specification Update", February, 2018. https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
  7. x265.org
  8. HandBrake, An OpenSource Video Transcoder.https://handbrake.fr/
  9. FFMPEG, A complete, cross-platform solution to record, convert and stream audio and video.
  10. MulticoreWare Inc., "x265 Receives Significant Boost from Intel Xeon Scalable Processor Family." http://x265.org/x265-receives-significant-boost-intel-xeon-scalable-processor-family/