Introduction
This guide is for users who are already familiar with media processing. It provides recommendations for configuring hardware and software that will yield reasonable baseline performance for generic media processing use cases on 4th Gen Intel® Xeon® Scalable processors. However, media processing is a complex domain and optimal performance may require consideration beyond the scope of this tuning guide.
4th Gen Intel Xeon Scalable processors deliver workload-optimized performance with improved architecture and built-in acceleration for AI, encryption, HPC, storage, database systems, and networking. They feature unique security technologies to help protect data on-premises or in the cloud.
Improvements that directly benefit basic media processing include increased core counts, memory performance with DDR5, and larger caches.
It’s not uncommon for applications to integrate basic media processing with artificial intelligence, content distribution, real-time streaming, or a variety of other functions. Noteworthy features include:
- New built-in accelerators for AI, HPC, networking, security, storage, and analytics
- Intel® Ultra Path Interconnect (Intel® UPI)
- Intel® Speed Select Technology
- Hardware-enhanced security
- New flex bus I/O interface (PCIe* 5.0 + CXL)
- New flexible I/O interface up to 20 HSIO lanes (PCI 3.0)
- Increased multisocket bandwidth with UPI 2.0 (up to 16 GT/s)
- Intel® Data Streaming Accelerator
Tuning guidance spans hardware, firmware, and software domains. Some parameters, like memory population, have only one mechanism of adjustment. Others, like scaling governors, can be adjusted through a variety of mechanisms, like via BOIS settings or operating system APIs. It’s not uncommon that a higher level of the solution stack will modify settings made lower in the stack. To ensure tuning settings intended are active during execution, we encourage instrumentation to read critical parameters at runtime. If runtime settings do not reflect specified tuning settings, it’s possible, even likely, that downstream firmware or software is changing tuning parameters.
Reference Workload
Intel employs an FFmpeg-based media transcode benchmark as a reference for general media processing. Tuning recommendations in this guide seek to provide good performance across the aggregate benchmark results. Intel does not license the benchmark, but replication is possible using the guidance below.
Use Cases
The benchmark implements file-based, single stream in, single stream out for 24 use cases spanning four codecs as illustrated in the table below.
Codec | Input Resolution | Xcode Type | Preset | ISA | GOP Length (seconds) | Frames to Encode | Output Res (matches input) | Output FPS (matches input) | Output Bitrate (Mb/s) | Max Bitrate (Mb/s) | Buffer Size (Mb) | Profile | Pre-video Switches | Other Switches | Encoder Params |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
svt av1 | FHD | 1:1 | 5 | AVX2 | 2 | all | FHD | 60 | 4 | 12 | 16 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | FHD | 1:1 | 5 | AVX3 | 2 | all | FHD | 60 | 4 | 12 | 16 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | FHD | 1:1 | 8 | AVX2 | 2 | all | UHD4Kc | 60 | 4 | 24 | 48 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | FHD | 1:1 | 8 | AVX3 | 2 | all | UHD4Kc | 60 | 4 | 24 | 20 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | FHD | 1:1 | 12 | AVX2 | 2 | all | UHD4Kc | 60 | 4 | 24 | 20 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | FHD | 1:1 | 12 | AVX3 | 2 | all | UHD4Kc | 60 | 4 | 24 | 20 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | UHD4Kc | 1:1 | 8 | AVX2 | 2 | all | UHD4Kc | 60 | 9 | 18 | 36 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | UHD4Kc | 1:1 | 8 | AVX3 | 2 | all | UHD4Kc | 60 | 9 | 18 | 36 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | UHD4Kc | 1:1 | 12 | AVX2 | 2 | all | FHD | 60 | 9 | 18 | 16 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt av1 | UHD4Kc | 1:1 | 12 | AVX3 | 2 | all | FHD | 60 | 9 | 10 | 16 | n/a | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | FHD | 1:1 | 1 | AVX2 | 2 | all | UHD4Kc | 60 | 5 | 18 | 20 | Main | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | FHD | 1:1 | 5 | AVX2 | 2 | all | UHD4Kc | 60 | 5 | 18 | 36 | Main | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | FHD | 1:1 | 5 | AVX3 | 2 | all | UHD4Kc | 60 | 5 | 18 | 36 | Main | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | FHD | 1:1 | 9 | AVX2 | 2 | all | UHD4Kc | 60 | 5 | 18 | 36 | Main | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | UHD4Kc | 1:1 | 1 | AVX3 | 2 | all | FHD | 60 | 12 | 10 | 36 | Main10 | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | UHD4Kc | 1:1 | 5 | AVX3 | 2 | all | FHD | 60 | 12 | 10 | 36 | Main10 | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | UHD4Kc | 1:1 | 9 | AVX2 | 2 | all | FHD | 60 | 12 | 24 | 36 | Main10 | -rc 1 -g 119-sc_detection 0 | ||
svt hevc | UHD4Kc | 1:1 | 9 | AVX3 | 2 | all | FHD | 60 | 12 | 10 | 36 | Main10 | -rc 1 -g 119-sc_detection 0 | ||
x264 | FHD | 1:1 | fast | AVX2 | 2 | all | FHD | 60 | 6 | 8 | 24 | High | -tune=psnr |
keyint=120;min-keyint=120:sliced-threads=0;scene-cut=0;threads=4 |
|
x264 | FHD | 1:1 | medium | AVX2 | 2 | all | FHD | 60 | 6 | 8 | 24 | High | -tune=psnr | keyint=120;min-keyint=120:sliced-threads=0;scene-cut=0;threads=4 | |
x264 | FHD | 1:1 | very slow | AVX2 | 2 | all | FHD | 60 | 5 | 8 | 48 | High | -tune=psnr | keyint=240;-min-keyint=240:sliced-threads=0;scene-cut=0;threads=8 | |
x265 | FHD | 1:1 | medium | AVX3 | 2 | all | FHD | 60 | 5 | 8 | 48 | Main | -tune=psnr | keyint=120,min-keyint=120:pools=4 | |
x265 | FHD | 1:1 | medium | AVX2 | 2 | all | FHD | 60 | 5 | 8 | 48 | Main | -tune=psnr | keyint=120,min-keyint=120:pools=4 | |
x265 | UHD4Kc | 1:1 | very slow | AVX2 | 2 | all | FHD | 60 | 12 | 8 | 48 | Main10 | -tune=psnr | keyint=240;min-keyint=240;pools=8 |
Input File
The input file aggregates 10 scenes from a variety of video content types including animated, live, and gaming. Each scene has a leading key frame and length of 240 frames, or 4 seconds at 60 fps. The entire input video is 2400 frames, or 40 seconds at 60 fps.
The input file is rendered at both FHD and consumer 4K resolution and matched to the output resolution of the use case for performance measurement purposes.
CPU Core Loading Methodology
Loading cores to ensure effective processor utilization (90%+) without thrashing the scheduler is important for accurate results. Intel dispatches FFmpeg instances based on the core counts. The formula is captured in the following table.
Codec |
Resolution |
MSO |
Preset |
FFMPEG Instances |
Threads/Encode |
---|---|---|---|---|---|
x.264 |
FHD |
1:1 |
very slow |
Ceiling(Logical Cores/4) |
8† |
x.264 |
FHD |
1:1 |
medium |
Ceiling(Logical Cores Div/2) |
8† |
x.264 |
FHD |
1:1 |
fast |
Ceiling(Logical Cores Div/2) |
8† |
|
|
|
|
|
|
x.265 |
UHD4Kc |
1:1 |
very slow |
Ceiling(Logical Cores/16) |
32† |
x.265 |
FHD |
1:1 |
medium |
Ceiling(Logical Cores Div 4) |
16† |
|
|
|
|
|
|
svt hevc |
FHD |
1:1 |
1 |
Ceiling(Logical Cores/8) |
‡ |
svt hevc |
FHD |
1:1 |
5 |
Ceiling(Logical Cores/8) |
‡ |
svt hevc |
FHD |
1:1 |
9 |
Ceiling(Logical Cores/4) |
‡ |
|
|
|
|
|
|
svt hevc |
UHD4Kc |
1:1 |
1 |
Ceiling(Logical Cores/12) |
‡ |
svt hevc |
UHD4Kc |
1:1 |
5 |
Ceiling(Logical Cores/12) |
‡ |
svt hevc |
UHD4Kc |
1:1 |
9 |
Ceiling(Logical Cores/12) |
‡ |
|
|
|
|
|
|
svt av1 |
FHD |
1:1 |
12 |
Ceiling(Logical Cores/8) |
‡ |
svt av1 |
FHD |
1:1 |
8 |
Ceiling(Logical Cores/8) |
‡ |
svt av1 |
FHD |
1:1 |
5 |
Ceiling(Logical Cores/8) |
‡ |
svt av1 |
UHD4K |
1:1 |
12 |
Ceiling(Logical Cores/12) |
‡ |
svt av1 |
UHD4K |
1:1 |
8 |
Ceiling(Logical Cores/12) |
‡ |
† Set via encoder params in FFmpeg command line |
Tuning
Platform Selection Considerations
Maximum Memory Speed
All processors in the 4th Gen Intel Xeon processor family are enabled for DDR5 memory but maximum speed is a function of the specific processor SKU. SKUs are available that support maximum speeds of 4000 mt/s, 4400 mt/s, and 4800 mt/s. Because media transcode generally benefits from faster memory speeds, Intel recommends selecting SKUs that support 4800 mt/s for best performance.
System Settings
The sections below describe parameters that can be set via the BIOS and/or operating system. Recommended settings yield good performance for the general media processing use case. The appropriate configuration for your application may vary.
IMPORTANT: Not all machines have the same mechanisms for setting performance. Techniques can vary widely by brand, model, architecture, and BIOS. When not familiar with a SUT, users are strongly encouraged to get guidance from an informed performance engineer.
Safe & Known Default
To help ensure a safe and known starting point, reset default settings in the BIOS and host operating system. BIOS reset is typically available in the BIOS subsystem; consult OEM guidance. Operating system reset is typically documented as part of the operating system distribution; consult the developer.
General Parameters
The six (6) parameters listed here are generally well-recognized dating back several years. Recommended settings and a short description of each are provided.
Tuning Parameter |
Typical* Location |
Recommended Setting |
Description |
|
Power & Policy |
BIOS |
Performance |
Optimizes system for performance. |
|
CPU Frequency Governors |
OS |
Performance |
CPU set to highest available frequency |
|
Turbo Boost |
BIOS |
Enabled |
Allows CPU to sustain max turbo frequency. |
|
C-States |
BIOS |
Disabled |
Prevents CPU transition to low power states. |
|
Uncore Frequency |
BIOS |
Minimum |
Ensure power to core is prioritized. |
|
Hyperthreading |
BIOS |
Enabled |
Enables one physical microprocessor to behave like two logical microprocessors. |
|
* Not all machines have the same mechanism for setting tuning parameters. Locations are shown as general guidance but are neither prescriptive nor exclusive. Refer to hardware and software reference material for clarification. Instrumenting workloads to capture actual performance parameters at runtime is strongly encouraged. |
|
Homeless Prefetcher
The homeless prefetcher manages demand miss into mid-level cache. It should be disabled for 4-tile dies commonly referred to as extreme core count (XCC). The homeless prefetcher should be enabled for monolithic medium core count (MCC) dies that are typically preferred for professional media processing.
Sub-NUMA Cluster (SNC)
(Not needed but is a good example)
SNC is a feature that provides similar localization benefits as Cluster-On-Die (COD), a feature found in previous processor families, without some of COD’s downsides. SNC breaks up the last level cache (LLC) into disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. SNC improves average latency to the LLC and is a replacement for the COD feature found in previous processor families.
Memory Configuration
Media transcoding workloads are sensitive to memory speed and configuration. Select the fastest memory supported on the architecture. Populate each memory channel to minimalize the distance data must travel to and from the CPU cores. The amount and size of the memory should be sized to accommodate the buffering requirements for the encoder, resolution, and desired quality. Details are discussed in the sections that follow.
Memory Speed
As mentioned previously, for best performance select 4th Gen Intel Xeon processor SKUs that support maximum speed of 4800 mt/s. Ensure your DDR5 DIMMs are 4800 mt/s or faster. Memory faster than 4800 mt/s will not improve performance since the CPU can’t go any faster. But memory speed slower than 4800 MT/s will slow the system down to match the speed of your memory.
DIMM Population
4th Gen Intel Xeon Scalable processor (formerly code named Sapphire Rapids) is an 8-channel memory architecture with DDR5 support up to 4800 MT/s. The memory controller supports up to 2 slots per channel. As a result, the typical mainboard will have 16 memory DIMM slots for each CPU (8 channels/CPU x 2 slots/channel=16 slots/CPU). For transcoding applications like the FFmpeg media benchmark, it is important to populate the computer with memory in each channel (NOT slot). Therefore, 4th Gen Intel Xeon Scalable processors should be populated with (a minimum of) 8 DIMMs per CPU meaning that every other slot can be empty. If you are working with a single CPU machine, you should have (at least) 8 DDR5 DIMMs. If you are working with a dual socket machine, you should have (at least) 16 DDR5 DIMMs.
Memory Sizing
Sizing the DIMMs can be more complicated. General guidance is to ensure at least 2 GB of free memory per logical core. With hyperthreading enabled, 4th Gen Intel Xeon Scalable processors will yield two logical cores for each physical core. For example, Intel Xeon 8468 processor is a 48C CPU intended for two-socket applications. The 2S server provides 192 logical cores (2 Sockets * 48 physical cores/socket * 2 logical cores/physical core). Allocating 2 GB of free memory per logical core will support most use cases up to 4K transcode. The free memory target in this case is 384 GB (2GB/logical core * 192 logical cores).
A couple of additional notes on memory sizing: Many applications can get by with significantly less memory than 2 GB per logical core. Some applications may require more. Firmware and operating systems carry memory overhead. Memory is a significant cost driver. Profiling specific instances of end-to-end (E2E) platforms is required to minimize cost while ensuring maximum performance.
Storage, Disk Configuration, and Settings
For file-based transcode, server-class SSDs will deliver adequate I/O performance. Other applications may benefit from the implementation of RAM-disks.
Network Configuration and Setting
Generally, offline media processing application are not network bandwidth limited. Live applications and adjacent use cases, like video production, will have varying requirements.
Related Tools and Information
There are a variety of mechanisms for setting or modifying tuning parameters. Sometimes operating systems, tools, or applications may change parameters carefully set at system start-up. To ensure that settings at workload execution match intentions, it is recommended to query the system configuration using the Intel® System Health Inspector1, also known as svr-info, or the Intel® Power Thermal Utility2
Feedback
We value your feedback. If you have comments (positive or negative) on this guide or are seeking something that is not part of this guide, let us know.
References
Viewable with a signed CNDA.