Model Performance of a C++ App Ported to a GPU

Intel® Advisor Cookbook

Download PDF

ID 767152

Date 6/24/2024

Version

Public

Model Performance of a C++ App Ported to a GPU

This recipe shows how to check the profitability of offloading your application to a graphics processing unit (GPU) accelerator using the Offload Modeling perspective of the Intel® Advisor from the command line interface (CLI).

For the Offload Modeling perspective, the Intel Advisor runs your application on a baseline CPU or GPU and models application performance and behavior on the specified target GPU. To model application performance on a target GPU device, you can use the of the three workflows:

CPU-to-GPU modeling for native applications (C, C++, Fortran) running on a CPU
CPU-to-GPU modeling for SYCL, OpenMP* target, and OpenCL™ applications offloaded to a CPU
GPU-to-GPU modeling for SYCL, OpenMP* target, and OpenCL™ applications running on a GPU

In this recipe, use the Intel Advisor CLI to analyze performance of C++ and SYCL applications with the Offload Modeling perspective to estimate the profitability of offloading the applications to the Intel® Iris® X^e graphics (gen12_tgl configuration).

Directions:

Prerequisites.
Explore offload opportunities.
View the results.
Examine performance speedup on the target GPU.
Alternative steps.

Scenario

Offload Modeling consists of several steps depending on the workflow:

CPU-to-GPU Modeling	GPU-to-GPU Modeling
Get the baseline performance data with the Survey analysis. Get call count data, the number of floating-point and integer operations, simulate cache and memory traffic on target device with the Characterization analysis. [Optional] Check for loop-carried dependencies with the Dependencies analysis. Model application performance on the specified target GPU with the Performance Modeling analysis.	Measure the hardware metrics of GPU-enabled kernels (for example, memory traffic) with the Survey analysis. Get the number of floating-point and integer operations on target device with the Characterization analysis. Model application performance on the specified target GPU with the Performance Modeling analysis.

CPU-to-GPU Modeling

GPU-to-GPU Modeling

Get the baseline performance data with the Survey analysis.
Get call count data, the number of floating-point and integer operations, simulate cache and memory traffic on target device with the Characterization analysis.
[Optional] Check for loop-carried dependencies with the Dependencies analysis.
Model application performance on the specified target GPU with the Performance Modeling analysis.

Measure the hardware metrics of GPU-enabled kernels (for example, memory traffic) with the Survey analysis.
Get the number of floating-point and integer operations on target device with the Characterization analysis.
Model application performance on the specified target GPU with the Performance Modeling analysis.

Intel Advisor allows you to run all analyses for the Offload Modeling perspective with a single command using special command line presets. You can control the analyses to run by selecting an accuracy level.

Ingredients

This section lists the hardware and software used to produce the specific result shown in this recipe:

Performance analysis tools: Intel Advisor 2021
Available for download as a standalone and as part of the Intel® oneAPI Base Toolkit.
Application: Mandelbrot is an application that generates a fractal image by matrix initialization and performs pixel-independent computations. There are two implementations available for download:
- A native C++ implementation MandelbrotOMP
- A SYCL implementation Mandelbrot
Compiler:
- Intel® C++ Compiler Classic 2021 (icpc)
  Available for download as a standalone and as part of the Intel® oneAPI HPC Toolkit.
- Intel® oneAPI DPC++/C++ Compiler 2021 (dpcpp)
  Available for download as a standalone and as part of the Intel® oneAPI Base Toolkit.
Operating system: Ubuntu* 20.04.2 LTS
CPU: Intel® Core™ i7-8559U to model CPU application performance on a target GPU (CPU-to-GPU modeling flow)
GPU: Intel® Iris® Plus Graphics 655 to model GPU application performance on a different target GPU (GPU-to-GPU modeling flow)

You can download a precollected Offload Modeling report for the Mandelbrot application to follow this recipe and examine the analysis results.

Prerequisites

Set up environment variables for the oneAPI tools. For example:
```
source /setvars.sh
```
Compile the applications.
- Compile the native C++ implementation of the Mandelbrot application:
```
cd MandelbrotOMP/ && make
```
- Compile the SYCL implementation of the Mandelbrot application:
```
cd mandelbot/ && mkdir build && cd build && cmake .. && make -j
```
For the SYCL implementation of the application running on GPU: Configure your system to analyze GPU kernels.

Explore Offload Opportunities

Run the Offload Modeling collection preset with the medium accuracy.

The medium accuracy is default, so you do not need to provide any additional options to the command. To collect performance data and model application performance with medium accuracy, run one of the following commands depending on a workflow:

To run the CPU-to-GPU modeling for the native Mandelbrot application:


advisor --collect=offload --config=gen12_tgl --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

To run the GPU-to-GPU modeling for the SYCL implementation of Mandelbrot application:


advisor --collect=offload --gpu --accuracy=low --project-dir=./gpu2gpu_offload_modeling -- ./src/Mandelbrot

NOTE:

You can change a target GPU for modeling by providing a different value to the --config option. See config for details and a full list of options.

For the low accuracy, Intel Advisor runs the following analyses:

Survey
Characterization with trip count and FLOP collection, cache simulation, and data transfer modeling
Performance Modeling

To see analyses executed for the medium accuracy, you can type the execution command with the --dry-run option:


advisor --collect=offload --config=gen12_tgl --dry-run --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

To generate commands for the GPU-to-GPU modeling, add the --gpu option to the command above.

The commands will be printed to the terminal:

advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1
advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=gen12_tgl --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1
advisor --collect=projection --no-assume-dependencies --config=gen12_tgl --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

View the Results

Intel Advisor stores the results of analyses with analysis configurations in the cpu2gpu_offload_modeling and gpu2gpu_offload_modeling directories specified with --project-dir option. You can view the collected results in several output formats.

View Result Summary in Terminal

After you run the command, the result summary is printed to the terminal. The summary contains the timings on the baseline and target devices, total predicted speedup, and the table with metrics per each of top five offload candidates with the highest speedup.

Result summary for the CPU-to-GPU modeling of the native C++ Mandelbrot application:

Result summary for the GPU-to-GPU modeling of the SYCL implementation of the Mandelbrot application:

View the Results in the Intel Advisor GUI

If you have the Intel Advisor graphical user interface (GUI) installed on your system, you can open the results there. In this case, you open the existing Intel Advisor results without creating any additional files or reports.

To open the CPU-to-GPU modeling result in Intel Advisor GUI, run this command:


advxe-gui ./cpu2gpu_offload_modeling

View an Interactive HTML Report in a Web Browser

After you run the Offload Modeling using Intel Advisor CLI, an interactive HTML report is generated automatically. You can view it at any time in your preferred web browser and you do not need the Intel Advisor GUI installed.

The HTML report is generated in the <project-dir>/e<NNN>/report directory and named as advisor-report.html.

For the Mandelbrot application, the report is located in the ./cpu2gpu_offload_modeling/e000/report/. The report location is also printed in the Offload Modeling CLI output:

…
Info: Results will be stored at '/localdisk/cpu2gpu_offload_modeling/e000/pp000/data.0'. See interactive HTML report in '/localdisk/adv_offload_modeling/e000/report'
…
advisor: The report is saved in '/localdisk/cpu2gpu_offload_modeling/e000/report/advisor-report.html'.

Interactive HTML report structure is similar to the result opened in the Intel Advisor GUI. Offload Modeling report consists of several tabs that are report summary, detailed performance metrics, sources, logs.

Examine Performance Speedup on the Target GPU

By default, the Summary tab opens first. It shows the summary of the modeling results:

Top Metrics and Program Metrics panes show per-program performance estimations and comparison with performance on the baseline device. For the native C++ Mandelbrot application, by offloading one loop, you can achieve 11.1x the estimated speedup. The execution time on the baseline device is 0.22 seconds, and the estimated execution time on the target device is 0.05 seconds.
Offload Bounded By pane shows factors that limit the performance of code regions. The native C++ Mandelbrot application is mostly bounded by compute.
Top Offloaded pane shows top five regions recommended for offloading to the selected target device. For the native C++ Mandelbrot application, one loop in serial_mandelbrot at mandelbrot.cpp:56 is recommended for offloading.
Top Non-Offloaded pane shows top five non-offloaded regions and the reasons why they are not recommended to be run on the target device. For the native C++ Mandelbrot application, one loop in stbi_zilb_compress at stb_image_write.h:885 is not recommended for offloading. Its estimated execution time on the target device is higher than the measured execution time on the baseline device.

NOTE:
Data in the Top Non-Offloaded pane is available only for the CPU-to-GPU modeling workflow.

To get more details, switch to the Accelerated Regions tab.

Examine the Code Regions table that visualizes the modeling results for each code region: it contains performance metrics measured on the baseline CPU or GPU platform and metrics estimated while modeling the application behavior on the target GPU platform. You can expand data columns and scroll the grid to see more metrics.
View the application source code with speedup and execution time for each loop/function. Do one of the following:
- Click a region of interest in the Code Regions table and check the Source tab below the table.
- Right-click the region of interest in the Code Regions table, select View Source to switch to the Source View tab of the report where you can examine the full source code.

Alternative Steps

You can run the Offload Modeling perspective using command line collection presets with one of the accuracy levels: low, medium, or high. The higher accuracy value you choose, the higher runtime overhead is added but more accurate results are produced.

Run Offload Modeling with Low Accuracy

To collect performance data and model application performance with low accuracy, run one of the following commands depending on a workflow:

To run the CPU-to-GPU modeling for the native Mandelbrot application:


advisor --collect=offload --config=gen12_tgl --accuracy=low --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

To run the GPU-to-GPU modeling for the SYCL implementation of Mandelbrot application:


advisor --collect=offload --gpu --config=gen12_tgl --accuracy=low --project-dir=./gpu2gpu_offload_modeling -- ./src/Mandelbrot

NOTE:

You can change a target GPU for modeling by providing a different value to the --config option. See config for details and a full list of options.

For the low accuracy, Intel Advisor runs the following analyses:

Survey
Characterization with trip count and FLOP collection
Performance Modeling

To see analysis executed for the low accuracy, you can type the execution command with the --dry-run option:


advisor --collect=offload --config=gen12_tgl --accuracy=low --dry-run --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

To generate commands for the GPU-to-GPU modeling, add the --gpu option to the command above.

The commands will be printed to the terminal:

advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1
 advisor --collect=tripcounts --flop --stacks --auto-finalize --target-device=gen12_tgl --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1
 advisor --collect=projection --no-assume-dependencies --config=gen12_tgl --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

Run Offload Modeling with High Accuracy

To collect performance data and model application performance with high accuracy, run one of the following commands depending on a workflow:

To run the CPU-to-GPU modeling for the native Mandelbrot application:


advisor --collect=offload --config=gen12_tgl --accuracy=high --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

To run the GPU-to-GPU modeling for the SYCL implementation of Mandelbrot application:


advisor --collect=offload --gpu --config=gen12_tgl --accuracy=high --project-dir=./gpu2gpu_offload_modeling -- ./src/Mandelbrot

NOTE:

You can change a target GPU for modeling by providing a different value to the --config option. See config for details and a full list of options.

For the low accuracy, Intel Advisor runs the following analyses:

Survey
Characterization with trip count and FLOP collection, cache simulation, and data transfer modeling with attributing memory objects and tracking accesses to stack memory
Dependencies
Performance Modeling

Note: The Dependencies analysis is only relevant to the CPU-to-GPU modeling.

To see analyses executed for the high accuracy, you can type the execution command with the --dry-run option:


advisor --collect=offload --config=gen12_tgl --accuracy=high--dry-run --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

To generate commands for the GPU-to-GPU modeling, add the --gpu option to the command above.

The commands will be printed to the terminal:

advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1
advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=medium --target-device=gen12_tgl --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1
advisor --collect=dependencies --filter-reductions --loop-call-count-limit=16 --select=markup=gpu_generic --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1
advisor --collect=projection --config=gen12_tgl --project-dir=./cpu2gpu_offload_modeling -- ./release/Mandelbrot 1

NOTE:

If you want to analyze an MPI application or an application with specific limitations, such as collecting floating-point/integer operations and trip counts data for certain application parts only with collection control APIs you should run the per-analysis commands one-by-one. The command line collection preset does not support such applications. See Run Offload Modeling Perspective from Command Line for details.

Key Take-Aways

You can use different Offload Modeling perspective workflows depending on your application:
- Run the CPU-to-GPU modeling to analyze native C, C++, Fortran applications running on a CPU or SYCL, OpenMP target, OpenCL™ applications offloaded to a CPU.
- Run the GPU-to-GPU modeling to analyze SYCL, OpenMP* target, OpenCL™ applications running on a GPU.
To run the Offload Modeling perspective from CLI, you can use one of the command line collection presets. The presets allow you to run the perspective with a specific accuracy level using a single command.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Advisor Cookbook

Model Performance of a C++ App Ported to a GPU

Scenario

Ingredients

Prerequisites

Explore Offload Opportunities

View the Results

Examine Performance Speedup on the Target GPU

Alternative Steps

Key Take-Aways

See Also