Intel® VTune™ Profiler Performance Analysis Cookbook

ID 766316
Date 9/05/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Using the Command-Line Interface to Analyze the Performance of a SYCL* Application running on a GPU (NEW)

This recipe illustrates how you use the command-line interface (CLI) in Intel® VTune™ Profiler to analyze the performance of a SYCL application offloaded on an Intel GPU. The recipe also describes how you can customize your report with collected data.

Intel® VTune™ Profiler provides a command line interface (the vtune tool) for remote analysis, scripted commands, and performance regression checks to monitor software performance over time. The vtune command line interface (CLI) provides an extensive set of options with which you can perform almost every task that is possible through the GUI. You can initiate analysis via the command line (running it as a background task or on a remote system) and then view the result or generate a report.

This recipe explores how you can use the CLI efficiently to generate reports on hotspots for these purposes:

  • Explore hotspots on the CPU/GPU side by running gpu-offload and gpu-hotspots analyses.

  • View the hottest GPU computing tasks annotated with:

    • Execution time
    • Data transfers
    • Working group sizes
    • SIMD width
    • Average GPU hardware metrics

  • Generate Source/Assembly code views to analyze instructions that possibly contributed to performance issues.

Here are the ingredients and instructions you need to explore efficient CLI use for GPU performance analysis.

Ingredients

Here are the minimum hardware and software requirements for this performance analysis.

Build and Compile the SYCL Application

  1. Go to the sample directory.

    cd <sample_dir>/VtuneProfiler/matrix_multiply_vtune
  2. The multiply.cpp file in the src directory contains several versions of matrix multiplication. Select a version by editing the corresponding #define MULTIPLY line in multiply.hpp.

  3. Compile your sample application:

    cmake . && make

    This command generates a matrix.icpx -fsycl executable.

    To delete the program, type:

    make clean

    This command removes the executable and object files that were created by the make command.

Ensure Prerequisites for GPU Analyses

Complete these steps before you run the GPU Offload Analysis or the GPU Compute/Media Hotspots Analysis.

  1. Prepare the system to run a GPU analysis. See Set Up System for GPU Analysis.

  2. Set up environment variables for Intel software tools:

    source $ONEAPI_ROOT/setvars.sh

Run GPU Offload Analysis on the SYCL Application

Use the GPU Offload Analysis as a starting point to identify if an application is CPU or GPU bound. Explore GPU offload efficiency through data transfer analysis and find performance-critical kernels for further analysis and optimization.

Run GPU Offload Analysis

In the CLI, type:

vtune -collect gpu-offload -r ./result_gpu-offload -- ./matrix.icpx -fsycl
By default, VTune Profiler generates a summary report after collecting data. This report includes information on the following fields:

  • Elapsed time
  • GPU utilization information
  • Information about the hottest computing tasks
  • Recommendations

To see the summary report, type:

vtune -report summary -r ./result_gpu-offload

If you do not need to see the summary report immediately after data collection, change this setting with the -no-summary option:

vtune -collect gpu-offload -no-summary -r ./result_gpu-offload -- ./matrix.icpx -fsycl
NOTE:
Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see GPU Architecture Terminology for Intel® Xe Graphics.

Generate Additional Reports to View Collected Data

  • CPU Hotspots Report

    This report displays a list of executed functions with CPU Time metrics, module names, source file paths and other parameters. The report also lists the hottest program units, starting with the most performance-critical unit. Use the -column, -filter, and -limit options to sort data into a tabular view:

    vtune -report hotspots -r ./result_gpu-offload
  • CPU Hotspots Report Filtered by Module and Grouped by Function

    Use the -filter option to focus on a specific part of report like a particular module. You can then use -group-by option to group results in a specific sequence.

    vtune -report hotspots -r ./result_gpu-offload -group-by=function -filter module=matrix.icpx -fsycl -q

    You can group the generated data in several ways like function name, module, source file path, or computing task.

    To see available groupings for a specific result, type:

    vtune -report hotspots -r ./result_gpu-offload -group-by=?
  • CPU Hotspots Report Sorted by Order

    Use the sort-desc and sort-asc options to sort specific information about hotspots in descending or ascending order. You can specify an order for up to three columns.

    vtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time:Execution" -q

    Here is another example:

    vtune -report hotspots -r result_gpu-offload -group-by module -sort-asc="CPU Time:Idle" -q

    To see available columns for a specific result, type:

    vtune -report hotspots -r ./result_gpu-offload -column=?

    The report data can contain such columns as CPU Time:Self, Module, and Source File.

  • Report of Top 'n' Time-Intensive Program Modules

    Use the limit option to see information about the top 'n' hotspots. For example, to understand details about the top five time-intensive program modules in your application, type:

    vtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time" -limit=5 -q
  • Hotspots Report Grouped by Computing Task (offloaded on GPU) with Transfer Columns

    This command displays hotspots information grouped by GPU computing task and also lists details about transfer sizes and transfer times between CPU and GPU:

    vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task -column=Transfer -q

    The report contains data transfers that are attributed to the respective computing task.

  • Hotspots Report Grouped by GPU Offload Computing Task and Time Columns

    This command displays hotspots information grouped by offload computing tasks and also lists details about transfer times between CPU and GPU:

    vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task-offload -column='Time' -q

Run GPU Compute/Media Hotspots Analysis

Our next step is to run the GPU Compute/Media Hotspots analysis. This analysis can help us to further explore performance improvements for the GPU-bound application or its stages.

Run GPU Compute/Media Hotspots Analysis

In the CLI, type this command to run the analysis:

vtune -collect gpu-hotspots -r ./result_gpu-hotspots -- ./matrix.icpx -fsycl

To see the summary report, type:

vtune -report summary -r ./result_gpu-hotspots
Generate Report to View Computing Tasks with L3 Metrics

Use this command to generate a report that lists only L3 metrics for computing tasks:

vtune -report hotspots -r result_gpu-hotspots -group-by=computing-task -column='L3' -q
Run GPU Compute/Media Hotspots Analysis with Dynamic Instruction Count and SIMD Utilization

Run the GPU Compute/Media Hotspots Analysis in the Characterization mode to collect data on dynamic instruction count and SIMD utilization:

vtune -collect gpu-hotspots -knob characterization-mode=instruction-count -r ./result_gpu-hotspots_inst-count -- ./matrix.icpx -fsycl
Generate Reports to View Source and Assembly Metrics
  • Source Code for Specific Computing Tasks

    Use this command to get the source code for a specific computing task:

    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=gpu-source-line -column="Source","GPU Instructions Executed:Int32 & SP Float" -q
  • Assembly Code for Specific Computing Tasks

    Use this command to get the assembly code for a specific computing task:

    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -q
  • Save Report as CSV File

    Use the -report-output option to save the generated report as a file. To specify the generation of a .csv report, use -format and -csv-delimiter options:

    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -report-output=result.csv -format=csv -csv-delimiter=comma -q

Run Custom Analysis with GPU Programming API Statistics

To get a focused analysis of timing and statistics related to GPU compute kernels, follow the GPU Compute/Media Hotspots analysis with a custom analysis that collects GPU Programming API statistics.

The kernel data available through this collection is similar to the data you collect when running the CLIntercept tool (with DevicePerformanceTiming option enabled) and with the nvprof tool in Summary mode.

Collect GPU Programming API Statistics

In the command line, type:

vtune -collect-with runss -knob collect-programming-api=true -no-summary -r ./result_gpu-programming-api -- ./matrix.icpx -fsycl
Generate Report to View Timing and Statistics for GPU Compute Kernels

This command generates a report that lists timings and instance count for computing tasks. The data is sorted by Total Time in descending order.

vtune -report hotspots -group-by=source-computing-task -column="Total Time,Average Time,Instance Count" -sort-desc="Total Time" -r ./result_gpu-programming-api/ -q
NOTE:

Discuss this recipe in the VTune Profiler developer forum.