Visible to Intel only — GUID: GUID-BFA89FE6-A1BB-40CC-8FF2-5A60B88E8608
Visible to Intel only — GUID: GUID-BFA89FE6-A1BB-40CC-8FF2-5A60B88E8608
Hardware Profile-Guided Optimization
Hardware Profile-Guided Optimization (HWPGO) is an alternative to traditionally Instrumented Profile-Guided Optimization (IPGO).
Traditional IPGO requires a first compilation phase to generate a binary with instrumentation to track execution counts based on a training run.
With HWPGO, this instrumentation is not needed. Instead, the optimized binary's execution is sampled on Performance Monitoring Unit (PMU) events using a tool such as Linux perf or SEP, and a profile is generated from the PMU-based data and debug info.
A major benefit of HWPGO over IPGO is that the binary used for training can be highly optimized, and collection can occur in a production environment.
Another benefit is that the PMU can provide new types of hardware introspection not possible with software instrumentation. For example, the 2024.0 compiler has support for unpredictable branch profiles. The compiler can sometimes use such a profile to prefer Conditional Move (CMOV) to conditional branches.
Execution Frequency Feedback
Compile with full optimization plus -fprofile-sample-generate.
While HWPGO does not require instrumentation, -fprofile-sample-generate is recommended to ensure that useful debug information is generated.
In this example, -fprofile-sample-generate is added to the application's existing optimization flags, -xCORE-AVX512 -Ofast:
icx -xCORE-AVX512 -Ofast -fprofile-sample-generate app.c -o app
By default -fprofile-sample-generate does not affect optimizations and should not affect execution speed. Refer to -fprofile-sample-generate for more details.
By default, debug info is embedded in object/executable files. To split debug info from those files, -fprofile-sample-generate with -gsplit-dwarf fprofile-dwo-dir=<dir> can be used together to specify where to store split .dwo files.
On Windows, the lld linker must be used. The icx driver will ensure this when -fprofile-sample-generate is specified. Use lld-link /fprofile-sample-generate when invoking the linker directly.
Create a PMU-based profile using SEP or Perf.
Linux:
perf record -o app.perf.data -b -c 1000003 -e br_inst_retired.near_taken:uppp -- ./app
Windows:
sep -start -out app.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=1000003:pdir:lbr:USR=YES -lbr no_filter:usr -perf-script ip,brstack -app .\app.exe
NOTE:The sep tool only includes samples for the executable directly launched by -app in -perf-script output. This means, for example, that invoking app.exe via a wrapper script or batch file will not include app.exe samples. This will be improved in the future.The PMU-based profile data is now in app.perf.data or app.tb7, and on Windows, a partial textual representation is available as app.perf.data.script.
The sampling period shown above (1000003) may need to be tuned depending on the application's characteristics and execution duration. The period chosen for each event type must be specified to llvm-profgen with the --sample-period option.
Use the PMU profile to create an LLVM profile. The profile describes how frequently source-level code locations were observed executing.
Linux:
llvm-profgen --perfdata app.perf.data --binary app --output app.freq.prof
The process is the same on Windows, except you use the textual app.perf.data.script profile:
llvm-profgen --perfscript app.perf.data.script --binary app.exe --output app.freq.prof
- If steps 2-3 occurred multiple times, merge profiles with something like llvm-profdata merge --sample run1.freq.prof run2.freq.prof run3.freq.prof --output app.freq.prof. This is useful for training against multiple datasets.
Recompile specifying the profile information to the compiler:
icx -xCORE-AVX512 -Ofast app.c -o app -fprofile-sample-use=app.freq.prof
You may add -fprofile-sample-generate to the above if additional feedback iterations are desirable.
- Optionally, repeat by jumping back to step 2.
Execution Frequency and Branch Mispredict Feedback
Compile with full optimization plus -fprofile-sample-generate:
icx -xCORE-AVX512 -Ofast -fprofile-sample-generate app.c -o app
Create a PMU-based profile using SEP or Perf.
The compiler can take advantage of both instruction execution and branch mispredict profiles. The two profiles can be collected simultaneously.
Linux:
perf record -o app.perf.data -b -c 1000003 -e br_inst_retired.near_taken:uppp,br_misp_retired.all_branches:upp -- ./app
Windows:
sep -start -out app.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=1000003:pdir:lbr:USR=YES,BR_MISP_RETIRED.ALL_BRANCHES:PRECISE=YES:SA=1000003:lbr:USR=YES -lbr no_filter:usr -perf-script event,ip,brstack -app .\app.exe
The PMU-based profile data is now in app.perf.data or app.tb7, and on Windows, a partial textual representation is available as app.perf.data.script.
NOTE:The additional event field requested of sep -- this event name field is required so that llvm-profgen can differentiate between PMU events.Use the single PMU profile to create two types of LLVM profiles. One will be the traditional execution frequency profile, and the other will be a profile of mispredicted branches.
Linux:
llvm-profgen --perfdata app.perf.data --binary app --output app.freq.prof --sample-period 1000003 --perf-event br_inst_retired.near_taken:uppp llvm-profgen --perfdata app.perf.data --binary app --output app.misp.prof --sample-period 1000003 --perf-event mr_misp_retired.all_branches:upp --leading-ip-only
The process is the same on Windows, except you will use SEP event names and the textual app.perf.data.script profile:
llvm-profgen --perfscript app.perf.data.script --binary app.exe --output app.freq.prof --sample-period 1000003 --perf-event BR_INST_RETIRED.NEAR_TAKEN:pdir llvm-profgen --perfscript app.perf.data.script --binary app.exe --output app.misp.prof --sample-period 1000003 --perf-event BR_MISP_RETIRED.ALL_BRANCHES --leading-ip-only
You should now have two source-level profiles: app.freq.prof and app.misp.prof.
If steps 2-3 occurred multiple times, merge profiles with something like llvm-profdata merge --sample run1.freq.prof run2.freq.prof run3.freq.prof --output app.freq.prof. This is useful for training against multiple datasets.
NOTE:The frequency and mispredict profiles should not be merged.Recompile specifying the profile information to the compiler.
icx -xCORE-AVX512 -Ofast app.c -o app -fprofile-sample-use=app.freq.prof -mllvm -unpredictable-hints-file=app.misp.prof
You may add -fprofile-sample-generate to the above if additional feedback iterations are desirable.
- Optionally, repeat by jumping back to step 2.
Notes on Windows Support
The Intel® oneAPI DPC++/C++ Compiler provides an llvm-profgen tool to understand Common Object File Format (COFF) binaries with associated Debugging with Attributed Record Formats (DWARF) debug information. The -fprofile-sample-generate option ensures that this debug information is generated.
The Linux perf tool is unavailable on Windows, but Intel® VTune™ includes a sep tool that can perform the relevant Last Branch Records (LBR) sampling on hardware events on both Windows and Linux.
Notes on the llvm-profgen and llvm-profdata tools
To ensure that you use the versions of these tools corresponding to the product compiler, you may use the following to locate them:
icx --print-prog-name=llvm-profgen
On Windows, icx is a command line-style driver, so you must use:
icx /nologo /clang:--print-prog-name=llvm-profgen
Alternatively, the --include-intel-llvm option to setvars scripts will place these tools in PATH.