Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference

ID 767253
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Hardware Profile-Guided Optimization

Hardware Profile-Guided Optimization (HWPGO) is an alternative to traditionally Instrumented Profile-Guided Optimization (IPGO).

Traditional IPGO requires a first compilation phase to generate a binary with instrumentation to track execution counts based on a training run.

With HWPGO, this instrumentation is not needed. Instead, the optimized binary's execution is sampled on Performance Monitoring Unit (PMU) events using a tool such as Linux perf or SEP, and a profile is generated from the PMU-based data and debug info.

A major benefit of HWPGO over IPGO is that the binary used for training can be highly optimized, and collection can occur in a production environment.

Another benefit is that the PMU can provide new types of hardware introspection not possible with software instrumentation. For example, the 2024.0 compiler has support for unpredictable branch profiles. The compiler can sometimes use such a profile to prefer Conditional Move (CMOV) to conditional branches.

Execution Frequency Feedback

  1. Compile with full optimization plus -fprofile-sample-generate.

    While HWPGO does not require instrumentation, it does require DWARF debug information on Linux and Windows. Special care must be taken on Windows to produce a binary with usable DWARF debug info. In particular, the compilation must include DWARF line number information and the lld-link linker must be used with DWARF-enabling flags to preserve this information. On Linux and Windows, -fprofile-sample-generate also enables additional debug information that may improve profile quality. To simplify this process, the use of -fprofile-sample-generate is recommended.

    In this example, -fprofile-sample-generate is added to the application's existing optimization flags, -xCORE-AVX512 -Ofast:

    icx -xCORE-AVX512 -Ofast -fprofile-sample-generate app.c -o app

    By default -fprofile-sample-generate does not affect optimizations and should not affect execution speed. Refer to -fprofile-sample-generate for more details.

    By default, debug info is embedded in object/executable files. To split debug info from those files, -fprofile-sample-generate with -gsplit-dwarf fprofile-dwo-dir=<dir> can be used together to specify where to store split .dwo files.

    On Windows, the lld linker must be used. The icx driver will ensure this when -fprofile-sample-generate is specified. Use lld-link /fprofile-sample-generate when invoking the linker directly.

  2. Create a PMU-based profile using SEP or Perf.

    Linux:

    perf record -o app.perf.data -b -c 1000003 -e br_inst_retired.near_taken:uppp -- ./app

    Windows:

    sep -start -out app.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=1000003:pdir:lbr:USR=YES -lbr no_filter:usr -perf-script ip,brstack -app .\app.exe

    NOTE:
    The sep tool only includes samples for the executable directly launched by -app in -perf-script output. This means, for example, that invoking app.exe via a wrapper script or batch file will not include app.exe samples. This will be improved in the future.

    The PMU-based profile data is now in app.perf.data or app.tb7, and on Windows, a partial textual representation is available as app.perf.data.script.

    The sampling period shown above (1000003) may need to be tuned depending on the application's characteristics and execution duration. The period chosen for each event type must be specified to llvm-profgen with the --sample-period option.

  3. Use the PMU profile to create an LLVM profile. The profile describes how frequently source-level code locations were observed executing.

    Linux:

    llvm-profgen --perfdata app.perf.data --binary app --output app.freq.prof

    The process is the same on Windows, except you use the textual app.perf.data.script profile:

    llvm-profgen --perfscript app.perf.data.script --binary app.exe --output app.freq.prof

  4. If steps 2-3 occurred multiple times, merge profiles with something like llvm-profdata merge --sample run1.freq.prof run2.freq.prof run3.freq.prof --output app.freq.prof. This is useful for training against multiple datasets.
  5. Recompile specifying the profile information to the compiler:

    icx -xCORE-AVX512 -Ofast app.c -o app -fprofile-sample-use=app.freq.prof

    You may add -fprofile-sample-generate to the above if additional feedback iterations are desirable.

  6. Optionally, repeat by jumping back to step 2.

Execution Frequency and Branch Mispredict Feedback

  1. Compile with full optimization plus -fprofile-sample-generate:

    icx -xCORE-AVX512 -Ofast -fprofile-sample-generate app.c -o app

  2. Create a PMU-based profile using SEP or Perf.

    The compiler can take advantage of both instruction execution and branch mispredict profiles. The two profiles can be collected simultaneously.

    Linux:

    perf record -o app.perf.data -b -c 1000003 -e br_inst_retired.near_taken:uppp,br_misp_retired.all_branches:upp -- ./app

    Windows:

    sep -start -out app.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=1000003:pdir:lbr:USR=YES,BR_MISP_RETIRED.ALL_BRANCHES:PRECISE=YES:SA=1000003:lbr:USR=YES -lbr no_filter:usr -perf-script event,ip,brstack -app .\app.exe

    The PMU-based profile data is now in app.perf.data or app.tb7, and on Windows, a partial textual representation is available as app.perf.data.script.

    NOTE:
    The additional event field requested of sep -- this event name field is required so that llvm-profgen can differentiate between PMU events.

  3. Use the single PMU profile to create two types of LLVM profiles. One will be the traditional execution frequency profile, and the other will be a profile of mispredicted branches.

    Linux:

    llvm-profgen --perfdata app.perf.data --binary app --output app.freq.prof --sample-period 1000003 --perf-event br_inst_retired.near_taken:uppp 
    llvm-profgen --perfdata app.perf.data --binary app --output app.misp.prof --sample-period 1000003 --perf-event mr_misp_retired.all_branches:upp --leading-ip-only

    The process is the same on Windows, except you will use SEP event names and the textual app.perf.data.script profile:

    llvm-profgen --perfscript app.perf.data.script --binary app.exe --output app.freq.prof --sample-period 1000003 --perf-event BR_INST_RETIRED.NEAR_TAKEN:pdir
    llvm-profgen --perfscript app.perf.data.script --binary app.exe --output app.misp.prof --sample-period 1000003 --perf-event BR_MISP_RETIRED.ALL_BRANCHES --leading-ip-only

    You should now have two source-level profiles: app.freq.prof and app.misp.prof.

  4. If steps 2-3 occurred multiple times, merge profiles with something like llvm-profdata merge --sample run1.freq.prof run2.freq.prof run3.freq.prof --output app.freq.prof. This is useful for training against multiple datasets.

    NOTE:
    The frequency and mispredict profiles should not be merged.

  5. Recompile specifying the profile information to the compiler.

    icx -xCORE-AVX512 -Ofast app.c -o app -fprofile-sample-use=app.freq.prof -mllvm -unpredictable-hints-file=app.misp.prof

    You may add -fprofile-sample-generate to the above if additional feedback iterations are desirable.

  6. Optionally, repeat by jumping back to step 2.

Notes on Windows Support

The Intel® oneAPI DPC++/C++ Compiler provides an llvm-profgen tool to understand Common Object File Format (COFF) binaries with associated Debugging with Attributed Record Formats (DWARF) debug information. The -fprofile-sample-generate option ensures that this debug information is generated.

The Linux perf tool is unavailable on Windows, but Intel® VTune™ includes a sep tool that can perform the relevant Last Branch Records (LBR) sampling on hardware events on both Windows and Linux.

Notes on the llvm-profgen and llvm-profdata tools

To ensure that you use the versions of these tools corresponding to the product compiler, you may use the following to locate them:

icx --print-prog-name=llvm-profgen

On Windows, icx is a command line-style driver, so you must use:

icx /nologo /clang:--print-prog-name=llvm-profgen

Alternatively, the --include-intel-llvm option to setvars scripts will place these tools in PATH.