Visible to Intel only — GUID: GUID-54DB630D-8D78-4401-9B53-D15821A9B9FA
Visible to Intel only — GUID: GUID-54DB630D-8D78-4401-9B53-D15821A9B9FA
Tools to Analyze Performance of OpenMP Applications
There are various tools and mechanisms that are available that help in analyzing the performance of OpenMP programs and identifying bottlenecks.
Intel® VTune™Profiler. Intel® Vtune Profiler can be used to analyze the performance of an application. It helps identify the most time-consuming (hot) functions in the application, whether the application is CPU- or GPU-bound, how effectively it offloads code to the GPU, and the best sections of code to optimize for sequential performance and for threaded performance, among other things. For more information about VTune Profiler, refer to the Intel® VTune™Profiler User Guide.
Level Zero Tracer. The Level Zero Tracer (ze_tracer) is a host and device tracing tool for Level Zero backend with support for SYCL and OpenMP GPU offload. For information about this tool, see the Level Zero Tracer section of this document.
When using ze_tracer with the -h and -d options, look at host- and device-side summaries at the end of the trace, under the headings “API Timing Results” and “Device Timing Results”, respectively.
Note that only explicit data transfers appear in the trace. Transfers of data allocated in Unified Shared Memory (USM) may not appear in the trace.
Note:
ze_tracer is useful for confirming that offloading of oneMKL kernels has occurred. The environment variable OMP_TARGET_OFFLOAD=MANDATORY environment variable does not affect oneMKL, and therefore cannot be used to guarantee that offloading of oneMKL kernels has occurred. One way to check that offloading of oneMKL kernels (and other kernels) has occurred is to see which kernels are listed under “Device Timing Results” in the trace generated by ze_tracer.
SYCL_PI_TRACE=2 environment variable. The DPC++ Runtime Plugin Interface (PI) is an interface layer between the device-agnostic part of SYCL runtime and the device-specific runtime layers which control execution on devices. Setting SYCL_PI_TRACE=2 provides a trace of all PI calls made with arguments and returned values. For more information, see the DPC++ Runtime Plugin Interface documentation.
LIBOMPTARGET_DEBUG=1 environment variable. LIBOMPTARGET_DEBUG controls whether or not debugging information from libomptarget.so will be displayed.
The debugging output provides useful information about things like ND-range partitioning of loop iterations, data transfers between host and device, memory usage, etc., as shown in the :Using More GPU Resources and :Minimizing Data Transfers and Memory Allocations sections of this document.
For more information about LIBOMPTARGET_DEBUG, see LLVM/OpenMP Runtimes.
LIBOMPTARGET_PLUGIN_PROFILE environment variable. LIBOMPTARGET_PROFILE allows libomptarget.so to generate time profile output. For more information, see LLVM/OpenMP Runtimes.
Dump of compiler-generated assembly for the device. You can dump the compiler-generated assembly by setting the following two environment variables before doing Just-In-Time (JIT) compilation (or before running the program in the case of Ahead-Of-Time (AOT) compilation).
export IGC_ShaderDumpEnable=1 export IGC_DumpToCustomDir=my_dump_dir
LLVM IR, assembly, and GenISA files will be dumped in the sub-directory named my_dump_dir (or any other name you choose). In this sub-directory, you will find a *.asm file for each kernel. The filename indicates the source line number on which the kernel occurs. The header of the file provides information about SIMD width, compiler options, as well as other information. Note that on Arctic Sound, Arctic Sound assembly will be generated; while on Ponte Vecchio, Ponte Vecchio assembly will be generated.
Also, in my_dump_dir, you will find an file named HardwareCaps.txt that provides information about the GPU, such as EU count, thread count, slice count, etc.
For more information about the Intel® Graphics Compiler and a listing of available flags (environment variables) to control the compilation, see Intel® Graphics Compiler for OpenCL™Configuration Flags for Linux Release
For additional information about debugging and profiling, refer to the Debugging and Profiling section of this document.