Intel® Advisor User Guide

ID 766448
Date 10/31/2024
Public
Document Table of Contents

Model MPI Application Performance on GPU

You can model your MPI application performance on a target graphics processing unit (GPU) device to determine whether you can get a performance speedup from offloading the application to the GPU.

Offload Modeling perspective of the Intel® Advisor includes the following stages:

  1. Collecting the baseline performance data on a host device with the Survey, Characterization (Trip Counts, FLOP), and/or Dependencies analyses. You can collect data for one or more MPI ranks, where each rank corresponds to an MPI process.
  2. Modeling application performance on a target device with the Performance Modeling analysis. You can model performance only for one rank at a time. You can run performance modeling several times for different ranks analyzed to examine the potential performance difference between them, but the topic does not cover this case.

Model Performance of MPI Application

Prerequisite: Set up environment variables to enable Intel Advisor CLI.

In the commands below:

  • Data is collected remotely to a shared directory.
  • The analyses are performed for an application running in four processes.
  • Path to an application executable is ./mpi_sample.

    Note: In the commands below, make sure to replace the application path and name before executing a command. If your application requires additional command line options, add them after the executable name.

  • Path to an Intel Advisor project directory is ./advi_results.
  • Performance is modeled for the default Intel® Arc™ graphics code-named Alchemist (xehpg_512xve configuration).

This example shows how to run Offload Modeling to model performance for the rank 1 of the MPI application. It uses the gtool option of the Intel® MPI Library to collect performance data on a baseline CPU. For other collection options, see Analyze MPI Applications.

  1. Optional, but recommended: Generate preconfigured command lines for your application using the --dry-run option. For example, generate the command lines using Intel Advisor CLI:
    advisor --collect=offload --dry-run --project-dir=./advi_results -- ./mpi_sample

    After you run it, a list of analysis commands to run the Offload Modeling for the specified accuracy level is printed to the terminal/command prompt. For the command above, the commands are printed for the default medium accuracy:

    advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results -- ./mpi_sample
     advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results -- ./mpi_sample
     advisor --collect=projection --no-assume-dependencies --config=xehpg_512xve --project-dir=./advi_results

    You need to modify the printed commands for the MPI syntax to use an MPI launcher. See Analyze MPI Applications for syntax details.

  2. Collect survey data for the rank 1 into the shared ./advi_results project directory.
    mpirun -gtool "advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results:1" -n 4 ./mpi_sample
  3. Collect trip counts and FLOP data for the rank 1.
    mpirun -gtool "advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results:1" -n 4 ./mpi_sample
  4. If you did not collect data to a shared location and need to copy the data to the local system to view the results, do it now.
  5. Model performance for the analyzed rank 1 of the MPI application that you ran the analyses for.
    advisor --collect=projection --config=xehpg_512xve --mpi-rank=1 --project-dir=./advi_results

    You can only model performance for one rank at a time. The results are generated for the rank specified in a corresponding ./advi_results/rank.1 directory.

  6. If you did not collect data to a shared location and need to copy the data to the local system to view the results, do it now.
  7. On a local system, view the results with your preferred method.

Configure Performance Modeling for MPI Application

By default, Offload Modeling is optimized to model performance for a single-rank MPI application. For multi-rank MPI applications, you can apply additional configuration and settings to adjust the performance model for a specific hardware or application. You can adjust the number of MPI ranks to run per GPU tile and/or exclude MPI time from the report.

In the commands below:

  • Data is collected remotely to a shared directory.
  • The analyses are performed for an application running in four processes.
  • Path to an application executable is ./mpi_sample.

    Note: In the commands below, make sure to replace the application path and name before executing a command. If your application requires additional command line options, add them after the executable name.

  • Path to an Intel Advisor project directory is ./advi_results.
  • Performance is modeled for Intel® Arc™ graphics code-named Alchemist (xehpg_512xve configuration).

Change the Number of MPI Processes per GPU Tile

Prerequisite: Set up environment variables to enable Intel Advisor CLI.

NOTE:
Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see GPU Architecture Terminology for Intel® Xe Graphics.

By default, Offload Modeling assumes that one MPI process, or rank, is mapped to one GPU tile. You can configure the performance model to adjust the number of MPI ranks to run per GPU tile to match your target device configuration.

To do this, you need to set the number of tiles per MPI process by scaling the Tiles_per_process target device parameter in a command line or a TOML configuration file. If you want to model performance for the Intel® Arc™ graphics code-named Alchemist, which is XeHPG 256 or XeHPG 512 configuration in Offload Modeling targets, use the Stack_per_process parameter. The parameter sets a fraction of a GPU tile that runs a single MPI process. For example, if you want to offload your MPI application with 8 processes to a target GPU device with 4 tiles, you need to adjust the performance model to run 2 MPI processes per tile, or to use 0.5 tile per process.

The number of tiles per process you set automatically adjusts:

  • the number of execution units (EU)
  • SLM, L1, L3 sizes and bandwidth
  • memory bandwidth
  • PCIe* bandwidth

The parameter accepts values from 0.01 to 12.0. Consider the following value examples:

Tiles_per_process/Stack_per_process Value

Number of MPI Ranks per Tile

1.0 (default)

1

12.0 (maximum)

1/12

0.25

4

0.125

8

To run the Offload Modeling with a custom tile-per-process parameter, you need to scale the parameter during the analysis. This change is one time and is applied only to the analysis you run it with. The commands below use the Tiles_per_process parameter for scaling. Replace it with Stack_per_process if needed.

  1. Generate pre-configured command lines for your application with the --set-parameter option to change the number of tiles per process. Use the --dry-run option of the collect.py script to generate commands to adjust cache configuration to the scaled parameter.

    For example, to generate commands for the ./advi_results project and model performance with 0.25 tiles per process, which corresponds to four MPI ranks per tile:

    advisor-python $APM/collect.py ./advi_results --set-parameter scale.Tiles_per_process=0.25 --dry-run -- ./mpi_sample

    After you run it, a list of analysis commands to run the Offload Modeling with the specified accuracy level is printed to the terminal/command prompt similar to the following:

    advisor --collect=survey --project-dir=./advi_results --static-instruction-mix -- ./mpi_sample
    advisor --collect=tripcounts --project-dir=./advi_results --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:64w:4k/1:192w:768k/1:4w:2m -- ./mpi_sample
    python $APM/collect.py ./advi_results  -m generic
    advisor --collect=dependencies --project-dir=./advi_results --filter-reductions --loop-call-count-limit=16 --ignore-checksums -- ./mpi_sample
  2. Copy the generated commands to your preferred text editor and modify them for the MPI-specific syntax. You need to add the following:
    • MPI launcher name and (optionally) gtool option for Intel® MPI Library
    • Number of MPI processes to launch
    • If you use gtool: MPI ranks to analyze

    See Analyze MPI Applications for syntax details.

    NOTE:
    You can skip the mark-up and Dependencies analysis step (the last two commands) because they add high overhead. See Check How Assumed Dependencies Affect Modeling for details.
  3. Run the modified commands for Survey, Trip Counts, and (optionally) Dependensies analyses one by one. For example, to run Survey and Trip Counts for the rank 1:
    mpirun -gtool "advisor --collect=survey --static-instruction-mix -- ./mpi_sample --project-dir=./advi_results:1" -n 4 ./mpi_sample
    mpirun  -gtool "advisor --collect=tripcounts --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:64w:4k/1:192w:768k/1:4w:2m --project-dir=./advi_results:1" -n 4 ./mpi_sample
  4. Run the performance modeling with the number of tiles per MPI processes specified using the --set-parameters option. For example, to model performance for the rank 1:
    advisor --collect=projection --project-dir=./advi_results --set-parameter scale.Tiles_per_process=0.25 --mpi-rank=1
    NOTE:
    Make sure to specify the same value for the --set-parameter scale.Tiles_per_process as for the dry-run step.

    The result is generated for the rank specified in a corresponding ./advi_results/rank.1 directory. You can transfer them to the development system, if needed, and view the results.

When you open the result in the Intel Advisor GUI or an interactive HTML report, you should see the tiles per process or stack per process parameter in the Modeling Parameters pane with the value you set. The parameter is in a read-only format. Notice that tiles per process or stack per process parameter shows the value per process, while other parameters in the pane show the value per device.

Ignore MPI Time

Prerequisite: Set up environment variables to enable Intel Advisor CLI.

For multi-rank MPI workloads, time spent in MPI runtime can differ from rank to rank, which may cause significant performance imbalance. Because of this, the whole application time and Offload Modeling results may be different from rank to rank. If MPI time is large and differs between ranks, and the MPI code does not include many computations, you can exclude time spent in MPI routines from the analysis so that it does not affect modeling results.

  1. Collect Survey, Trip Counts, and (optionally) Dependencies data for your application. See Analyze MPI Applications for details.
  2. Run the performance modeling with time in MPI calls ignored using the --ignore=MPI option.
    advisor --collect=projection --project-dir=./advi_results --ignore=MPI --mpi-rank=1

    The results are generated in a ./advi_results/rank.1 directory. You can transfer them to the development system and view the results.

In the report generated, all per-application performance modeling metrics are calculated based on application self-time with time spent in MPI calls excluded from the analysis. This should improve modeling across ranks.

NOTE:
This option affects only metrics for the whole program in the Summary tab. Metrics for individual regions are not recalculated.

View Results

Intel Advisor saves collection results into subdirectories for each rank analyzed under the project directory specified with --project-dir. The modeling results are available only for the ranks that you ran the Performance Modeling for, for example, as specified with the --mpi-rank option.

To view the performance or dependency results collected for a specific rank, you can do one of the following.

View Results in GUI

From the Intel Advisor GUI, open a result project file *.advixeproj that resides in the <project-dir> /rank.<n> directory.

You can also open the GUI from command line:

advisor-gui ./advi_results/rank.1
NOTE:
If you used --no-auto-finalize when collecting data, make sure to set paths to application binaries and sources before viewing the result so that Intel Advisor can finalize it properly.

View Results in Command Line

After you run the Performance Modeling analysis, the summary result of the modeling is printed to a terminal/command prompt. Examine the data to learn the estimated speedup and top five offloaded regions.

View Results in an Interactive HTML Report

Open an interactive advisor-report HTML report generated in the respective rank directory at <project-dir>/rank.<n>/e<NNN>/report and a set of CSV reports in the respective rank directory at <project-dir>/rank.<n>/p<NNN>/data.0.

See Also