Game Tuning with Intel®

ID 724199
Updated 4/3/2023
Version Latest
Public

author-image

By

 

As modern games continue to become more popular and complex, creating a game that performs well across multiple platforms (with different hardware specifications) becomes a daunting task. Multiple variables can cause performance issues that lead to unsatisfactory gameplay. Equip yourself with a powerful toolset of performance analyzer tools from Intel to tackle this complexity and develop games more efficiently.

What You Will Learn

Use this step-by-step guide to:

  • Analyze the performance of your game
  • Determine if your game is GPU- or CPU-bound
  • Resolve bottlenecks with Intel performance analyzer tools

At each step, find relevant resources to dig deeper.

Who This is For

Game developers who are looking to understand and improve the performance of their code. 

What You Will Need

The Game Tuning Workflow

Step 1: Capture a trace.
Step 2: Analyze the trace to identify if your game is GPU- or CPU-bound.

  • If your game is GPU-bound, go to step 3.
  • If your game is CPU-bound, go to step 5.

Step 3: Capture a stream to understand frame details.
Step 4: Analyze your captured stream or frame to resolve GPU bottlenecks
Step 5: Analyze your game for CPU bottlenecks
Step 6: Investigate Critical CPU Bottlenecks
Step 7: Get deeper insights

game profiling workflow
Game Profiling Workflow

 

Get Familiar with Intel® GPA

Term Definition
Graphics Monitor This is the hub tool for the Intel GPA analysis tools. Use Graphics Monitor to configure capture options for capturing a trace, stream, or frame. 
Graphics Trace Analyzer Use this tool to identify problems with distributing workload across computing resources:
  • CPU cores
  • CPU threads
  • GPU execution
This tool visualizes several types of activity with your game, from CPU thread activity to GPU execution activity.
Graphics Frame Analyzer

Use this tool to examine a frame. Understand the performance impact of specific API calls at various stages of the rendering pipeline. You can dive into draw-call issues and understand how they impact the frames-per-second (FPS) at various stages of the rendering pipeline.

CPU-Bound When your CPU is continuously busy and your GPU is idle, you are time-bound by the CPU. The GPU cannot do more work because the CPU is too busy to assign it any more work in that frame. Your code is then CPU-bound.
GPU-Bound If your GPU is continuously busy while your CPU has idle spots, the GPU is slowing game performance and your game is GPU-bound. Optimize the work on the GPU so that the CPU can assign more work to the GPU.
Trace When Graphics Trace Analyzer captures a trace, you have a record of activity on both the CPU and the GPU during application execution for a specified number of seconds.
Frame A frame Is a collection of data associated with a single computer-generated image.
Stream A stream is a collection of captured frames.
 

Step 1: Capture a Trace

You capture a trace to understand the workings of the GPU and CPU cores during the execution of your game. Use Graphics Monitor to capture a trace and then analyze it using Graphics Trace Analyzer (Step 2). 
In the Graphics Monitor application,

  1. Choose your game executable.
  2. Set any command-line arguments that your game may require.
  3. Select the Trace capture type.
  4. Click Start.
start a trace capture
Start a trace capture

 

Tip: Trace durations can get long. Before you capture a trace, set a duration (in seconds) for the trace capture. Use the Trace Duration (sec) option in the Options > Trace menu in Graphics Monitor.

 

Tip: Enable developer mode in Windows* to successfully capture metrics data with Graphics Monitor.  

This launches your game executable with a HUD overlay.The HUD overlay displays basic metrics and key indicators. 

HUD Overlay during trace capture
HUD Overlay during trace capture

 

Note: While trace captures cannot collect all metrics from all third party GPUs, they can collect several metrics from other GPU vendors.

 

Learn More:


Step 2: Analyze the Trace 

Your next step is to analyze the captured trace and determine if your game is GPU/CPU-bound.  Once you capture a trace, a thumbnail with a trace icon appears in the upper right corner of Graphics Monitor. Click the trace thumbnail to open the Graphics Trace Analyzer application.

figure 1
Open Graphics Trace Analyzer

 

Note: Loading trace data into Graphics Trace Analyzer can take several seconds.


A typical trace capture involves 3 to 5 seconds of gameplay. This is sufficient for Graphics Trace Analyzer to collect and display several types of information, like:

  • CPU execution tasks
  • GPU rendering packets
  • Simultaneous visualization of CPU and GPU activity

Once you load the trace, zoom in to examine the trace data. If you see gaps in the CPU execution while the GPU is busy, this indicates that your game is GPU-bound at this time slice.

CPU Execution
CPU Execution

 

GPU Execution
GPU Execution


Learn More:

 

Step 3: Capture a Stream 

For a GPU-bound game, your next step is to capture a stream. A stream captures these details from one or more frames:

  • Textures
  • Buffers
  • Shader calls
  • Hardware counters

Analyze the data from these frames to locate the bottlenecks in your rendering pipeline, so you can optimize your game.

  1. Open Graphics Monitor.
  2. Select the Stream capture type.
  3. In the Options menu, 
    • If you are using Microsoft* DirectX 12 or Vulkan* APIs, select the Defer stream capture option. This enables you to capture multiple streams at any point in your gameplay. 
    • If you are using Microsoft* DirectX 11 API, the Defer stream capture option Is not available. The capture begins at the start of game play and ends when you close the capture window.
Select options to enable deferred stream capture
Select options to enable deferred stream capture

 

Captured stream in multiframe view in Graphics Frame Analyzer
Captured stream in multiframe view inside Graphics Frame Analyzer

 

Learn More:


Step 4: Analyze a Stream 

Once you have captured a stream with Graphics Monitor, you can analyze it using Graphics Frame Analyzer. 

  1. In the Graphics Monitor UI, click on the thumbnail for the captured stream. This opens the stream in the multiframe view in Graphics Frame Analyzer. 

    figure 2

  2. Select a frame and open it. When a frame captures data, it actually captures the relevant API calls. Only when you open a frame does the data collection actually happen. This is why opening a frame can take some time.
  3. Profile each draw call in a frame; the geometries, textures, buffers etc. 
  4. Click the flame icon in the upper left corner to open the Advanced Profiling Mode. In this mode, you can:
    • Observe the top bottlenecks
    • Understand the most relevant metrics and ensure they are satisfactory
    • Analyze the resources (geometries, textures, and buffers added to the frame) to see if any of these are overly complex

These are just some examples of how you can troubleshoot performance issues in your captured stream. 

Tip: When you use Graphics Frame Analyzer to open a captured frame on different GPUs, you can compare performance across those GPUs.

Learn More:


Get Familiar with Intel® VTune™ Profiler

When you write programs for games and/or game engines, use insights from Intel® VTune™ Profiler to tune single threaded and multithreaded performance. VTune Profiler is a performance analysis tool that helps you identify the most time-consuming functions in your application and suggest ways to optimize them. This tool can also help you identify if your application is CPU/GPU-bound, resolve CPU bottlenecks, and improve the efficiency of offloading portions of your code onto the GPU.

Simplify game development by using VTune Profiler in these ways:

  • Optimize CPU compute-intensive tasks:
    • Get finer CPU granularity by drilling down to the code level and identifying the slow task, function, line of code, or call stack.
    • Identify reasons for slow CPU performance- cache misses, branch misprediction etc.
  • Tune CPU threading performance: Use Threading Analysis to examine several common problems related to parallelism, such as thread imbalance and excessive context switching.
  • Tune workload balance and interaction between CPU and GPU: Improve computational performance by analyzing detailed profile data and identifying whether your game or engine is CPU or GPU-limited. Use Intel® VTune™ Profiler for a deeper analysis. Identify CPU bottlenecks, see a detailed summary, and drill down to the function level.
  • Annotate and sort by frames: Annotate data with frames to see each frame on the timeline. Identify slow and fast frames and filter your data to see only the functions that were running during the slowest frames, or correlate timeline patterns with frame activity.
  • Optimize cache usage: Tune bandwidth-limited software and identify those memory objects which are bottlenecks.


This table describes several features in Intel VTune Profiler to help you profile the performance of your game or game engine.

Feature

Support in Intel VTune Profiler

OS Support Full support on both Linux and Windows
Hotspots/stacks/threads Two ways to get top functions and call stacks:
  • User-mode sampling via the OS
  • Hardware-based sampling using the VTune driver.
Source code view
  • Source line metrics
  • Assembly view
Hardware utilization analysis
  • CPU and GPU utilization
  • Memory access
  • I/O access
Instrumentation API

Use the Instrumentation and Tracing Technology (ITT) API in VTune Profiler to generate and control the collection of trace data during its execution. Unity and Unreal Engine already use the ITT API to support profiling with VTune Profiler.

Graphics API Support

Microsoft* DirectX, OpenCLTM, SYCL

Engine Support Unity, Unreal Engine
Interface GUI, CLI
Profiling level Application, system-wide
Language Support

Most languages, including but not limited to:

  • C++
  • Python
  • Java
  • Lua
  • .Net
  • Rust
  • Mixed stacks
XPU support CPU, Hybrid CPU, Intel GPU

 

Learn More:


Step 5: Analyze CPU Bottlenecks 

CPU performance can impact the performance of your game in several ways. To identify a starting point, run a Hotspots analysis in Intel VTune Profiler. Besides providing an overall assessment of thread performance, this analysis provides information about the top functions in your application that consume CPU time.

Summary of hotspots collection with top functions
Summary of hotspots collection with top functions

 

Summary of hotspots collection with CPU Utilization and Frame Rate histograms
Summary of hotspots collection with CPU Utilization and Frame Rate histograms


VTune Profiler uses the Instrumentation and Tracing Technology (ITT) API to help identify tasks and correlate them with frames and frame rate. Unity and Unreal Engine both use this API to highlight engine-specific tasks. 

Note: Intel VTune Profiler supports a wide variety of applications and workloads. Certain insights and recommendations provided by VTune Profiler, such as 100% utilization of available CPUs, may not be suitable for gaming workloads.

 

Learn More:

 

Step 6: Investigate Critical CPU Bottlenecks

The results of a Hotspots analysis include two important sets of information:

Threading Issues

Here are some common threading issues you can observe from a hotspots analysis:

Poor parallelism

When a task has the resources it needs to execute, but is waiting on another task to finish first, the reason could be lock contention, or the need for additional threads. 
For example, in the figure below, threads are running in parallel, but most threads finish and then spin while they wait for the rest of the threads to complete.

thread execution over time
Thread execution over time.


Run a Threading analysis in VTune Profiler to visualize locks and waits and identify synchronization problems. 

 

Threading Overhead

When a task has more threads than available CPUs, the scheduler can take extra time to switch between threads. Run a Threading analysis in VTune Profiler to see how context switches and transitions affect the performance of your game.

Thread transitions
A large number of transitions happening between many threads that spend significant time waiting.

 

Hotspots in Functions

Next, start resolving hotspots that you observe in your functions. 
Identify a function to optimize, and then run hardware event-based sampling to get more details about its actual performance.

Unnecessary computes 

Based on the results of your hotspots analysis, you may observe a significant number of unnecessary compute operations. These computations can happen in several ways. For example, 

  • Repeating the same calculation in a loop
  • Calling a function that does more work than what is needed for a particular task (see figure below)
A string function consumes most of the CPU time, which may include unnecessary instructions.
A string function consumes most of the CPU time, which may include unnecessary instructions.

 

Doubleclick on a function name to open the source code view. See which lines of code execute the most. This may help you identify redundant operations and reduce the number of instructions for the functions.

 

Older Software

You may also see a hotspot in an engine task or third-party library. If you are using an older version of the game engine or any other software to run your game, you may see an improvement when upgrading to a newer version of that game engine or software.


Learn More:


Step 7: Get Deeper Insights

Continue exploring CPU bottlenecks by focusing on these aspects:


Inefficient Memory Access

This occurs when data is stored in one pattern but is accessed in another. If the data cannot be read sequentially, the CPU cannot make efficient use of the cache. Run the Memory Access analysis type in VTune Profiler  to see if slow memory accesses cause a performance issue.

Memory Utilization
Memory access analysis showing memory utilization.

 

Poor Microarchitecture Usage

When a hotspot has a high CPI (clock cycles per instruction) rate, that means the instructions themselves are taking a long time to execute. Ideally the CPU can execute an instruction in ¼ clock cycle, but a number of issues can cause it to take much longer. To identify the bottleneck at this level, use the Microarchitecture Exploration analysis.

Microarchitecture analysis
Microarchitecture analysis on a hybrid CPU platform with metrics for P-Core and E-Core.


                  
Learn More:

 

Summary

In game performance, while the GPU does most of the heavy lifting, it is the CPU that assigns work to the GPU. Efficient game performance relies on efficient GPU as well as CPU performance. Resolving CPU bottlenecks ensures that you employ the full potential of your CPU, which in turn drives GPU use to its full potential. A combination of performance profiling tools can help you get the most performance from the games you develop.

Get Intel® Graphics Performance Analyzers             Get Intel® VTuneTM Profiler