Visible to Intel only — GUID: GUID-9CB52FDB-9523-487D-8DA7-FAA4635DE5ED
Trace the Offload Process
When a program that offloads computation to a GPU is started, there are lot of moving parts involved in program execution. Machine-independent code needs to be compiled to machine-dependent code, data and binaries need to be copied to the device, results returned, etc. This section will discuss how to trace all this activity using the tools described in the oneAPI Debug Tools section.
Kernel Setup Time
Before offload code can run on the device, the machine-independent version of the kernel needs to be compiled for the target device, and the resulting code needs to be copied to the device. This can complicate/skew benchmarking if this kernel setup time is not considered. Just-in-time compilation can also introduce a noticeable delay when debugging an offload application.
If you have an OpenMP* offload program, setting LIBOMPTARGET_PLUGIN_PROFILE=T[,usec] explicitly reports the amount of time required to build the offload code “ModuleBuild”, which you can compare to the overall execution time of your program.
Kernel setup time is more difficult to determine if you have a SYCL* offload program.
If Level Zero or OpenCL™ is your backend, you can derive kernel setup time from the Device Timing and Device Timeline returned by onetrace or ze_tracer.
If OpenCL™ is your backend, you may also be able to derive the information by setting the BuildLogging, KernelInfoLogging, CallLogging, CallLoggingElapsedTime, KernelInfoLogging, HostPerformanceTiming, HostPerformanceTimeLogging, ChromeCallLogging, or CallLoggingElapsedTime flags when using the Intercept Layer for OpenCL™ Applications to get similar information. You can also derive kernel setup time from the Device Timing and Device Time- line returned by onetrace or cl_tracer.
You can also use these tools to supplement the information returned by LIBOMPTARGET_PLUGIN_PROFILE=T.
For details on how Intel® VTune™ Profiler can analyze kernel setup time, see Enable Linux* Kernel Analysis
Monitoring Buffer Creation, Sizes, and Copies
Understanding when buffers are created, how many buffers are created, and whether they are reused or constantly created and destroyed can be key to optimizing the performance of your offload application. This may not always be obvious when using a high-level programming language like OpenMP or SYCL, which can hide a lot of the buffer management from the user.
At a high level, you can track buffer-related activities using the LIBOMPTARGET_DEBUG and SYCL_PI_TRACE environment variables when running your program. LIBOMPTARGET_DEBUG gives you more information than SYCL_PI_TRACE - it reports the addresses and sizes of the buffers created. By contrast, SYCL_PI_TRACE just reports the API calls, with no information you can easily tie to the location or size of individual buffers.
At a lower level, if you are using Level Zero or OpenCL™ as your backend, the Call Logging mode of onetrace or ze_tracer will give you information on all API calls, including their arguments. This can be useful because, for example, a call for buffer creation (such as zeMemAllocDevice) will give you the size of the resulting buffer being passed to and from the device. onetrace and ze_tracer also allows you to dump all the Level Zero device-side activities (including memory transfers) in Device Timeline mode. For each activity one can get append (to command list), submit (to queue), start and end times.
If you are using OpenCL as your backend, setting the CallLogging, CallLoggingElapsedTime, and ChromeCallLogging flags when using the Intercept Layer for OpenCL™ Applications should give you similar information. The Call Logging mode of onetrace or cl_tracer will give you information on all OpenCL API calls, including their arguments. As was the case above, onetrace and cl_tracer also allow you to dump all the OpenCL device-side activities (including memory transfers) in Device Timeline mode.
Total Transfer Time
Comparing total data transfer time to kernel execution time can be important for determining whether it is profitable to offload a computation to a connected device.
If you have an OpenMP offload program, setting LIBOMPTARGET_PLUGIN_PROFILE=T[,usec] explicitly reports the amount of time required to build (“DataAlloc”), read (“DataRead”), and write data (“DataWrite”) to the offload device (although only in aggregate).
Data transfer times can be more difficult to determine if you have a C++ program using SYCL.
If Level Zero or OpenCL™ is your backend, you can derive total data transfer time from the Device Timing and Device Timeline returned by onetrace or ze_tracer.
If OpenCL is your backend, you can use onetrace or cl_tracer, or alternatively you may also be able to derive the information by setting the BuildLogging, KernelInfoLogging, CallLogging, CallLoggingElapsedTime, KernelInfoLogging, HostPerformanceTiming, HostPerformanceTimeLogging, ChromeCallLogging, or CallLoggingElapsedTime flags when using the Intercept Layer for OpenCL Applications.
For details on how Intel® VTune™ Profiler can analyze transfer setup time, see these sections of the Intel® VTune™ Profiler User Guide: GPU Offload AnalysisGPU Compute/Media Hotspots ViewHotspots Report
Kernel Execution Time
If you have an OpenMP offload program, setting LIBOMPTARGET_PLUGIN_PROFILE=T[,usec] explicitly reports the total execution time of every offloaded kernel (“Kernel#…”).
For programs using SYCL to offload kernels:
If Level Zero or OpenCL™ is your backend, the Device Timing mode of onetrace or ze_tracer will give you the device-side execution time for every kernel.
If OpenCL is your backend , you can use onetrace or cl_tracer, or alternatively you may be able to derive the information by setting the CallLoggingElapsedTime, DevicePerformanceTiming, DevicePerformanceTimeKernelInfoTracking, DevicePerformanceTimeLWSTracking, DevicePerformanceTimeGWSTracking, ChromePerformanceTiming, ChromePerformanceTimingInStages flags when using the Intercept Layer for OpenCL™ Applications.
For details on how Intel® VTune™ Profiler can analyze kernel execution time, see Accelerators Analysis Group
When Device Kernels are Called and Threads are Created
On occasion, offload kernels are created and transferred to the device a long time before they actually start executing (usually only after all data required by the kernel has also been transferred, along with control).
You can set a breakpoint in a device kernel using the Intel® Distribution for GDB* and a compatible GPU. From there, you can query kernel arguments, monitor thread creation and destruction, list the current threads and their current positions in the code (using “info thread”), and so on.