Troubleshoot Highly Parallel Applications
Overview
Running compute-intensive code on a CPU or motherboard can strain resources. Using attached accelerators such as GPUs and FGPAs to free up CPU resources is referred to as offloading. This workflow shows you the steps to take to troubleshoot your applications that use OpenMP* or the SYCL* API with extensions to offload resources.
What You Will Learn
This workflow provides a recommended path for troubleshooting, as well as documentation and resources for common problems. For hands-on learning, there are three samples referenced in the workflow, all based on matrix multiply. The samples include source code with errors and source code that illustrates the solution with step-by-step instructions.
Samples included:
- Guided Matrix Multiply Invalid Contexts
- Guided Matrix Multiply Exceptions
- Guided Matrix Multiply Race Conditions
- Guided Matrix Multiplication Bad Buffers
- Guided Matrix Multiplication Illegal Shared Local Memory (SLM) Size
Who This Is For
Software developers familiar with targeting attached accelerators (such as GPUs and FPGAs) using the Intel® oneAPI Base Toolkit and Intel® oneAPI HPC Toolkit software.
Configure Your oneAPI Environment
- Use the following guides:
- oneAPI Installation Guide
- Get Started Guides Linux* | Windows* | macOS*
- Install Intel® Toolkits and Intel® Graphics Compute Runtime in HPC Cluster Environment
- If deploying to a remote target, you may need the latest runtime versions.
- Install the latest driver updates for OpenCL™ platform and oneAPI Level Zero
- Use the Diagnostics Utility for Intel toolkits to check for common configuration problems. This is included in your oneAPI installation.
Get Help
At any point in the following steps, you can also submit requests to the Online Service Center. Once you are signed in:
- Select Request Support, and then select Choose from a List.
- From the list, select Software, and then Development Software.
- Choose either Compiler or GPU Software Stack, and then include the output of errors in your request.
Step 1: Prepare the Application
Prepare and write your application:
- Start with a serial implementation of the algorithm that you can use to verify expected results.
- Identify areas of the code that might benefit from parallelism, and then implement them as parallel loops or parallel kernel invocations.
By default, SYCL applications use the oneAPI Level Zero runtime, which provides a low-level, direct-to-metal interface for the devices in a oneAPI platform.
Kernel-Based Application Development
Take advantage of additional parallelism on attached compute accelerators by implementing some of the parallelism in the application using the kernel-based approach from SYCL. Test it during host-only execution using the OpenCL driver for CPI. (It is easier to debug many issues on the CPU.)
Throughout this process, check the results of the parallel implementation against the serial implementation with various real-world datasets.
Tip If the code fails to build, the most common errors are caused by linking and compilation failures. Compile using --save-temps -v to generate verbose error output. This may illustrate if the failure is happening in other binaries like clang-offload-bundler, llvm-link, or llvm-spirv.
Resources
- oneAPI Level Zero runtime
- Intel® 64 and IA-32 Architectures Software Developer Manuals
- Intel compilers and optimizations: Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
- Intel oneAPI Programming Guide
- Intel-optimized libraries: Intel oneAPI Programming Guide: oneAPI Library Overview
- Intel oneAPI GPU Optimization Guide
Step 2: Resolve Build and Runtime Crashes
To debug failed attempts to run parallel code (kernels) on a specific device (CPU or GPU), do the following:
- Run the kernels on the CPU before trying on the GPU, because CPU debugging tends to be easier. Once your code is running correctly on the CPU, target the GPU using either the OpenCL or oneAPI Level Zero runtimes.
- If your code fails when it tries to run a kernel, but does not actually fail inside the kernel itself (such as in a library or driver you did not write), troubleshoot the problem at the runtime and driver level.
Build Your Application without Optimizations
Removing optimizations makes it possible to follow all local and passed variables, get reliable line numbers, and makes it easier to find the root cause of issues like memory overruns or bad pointers.
Sometimes applications only show problems when built with optimizations. To find the root cause of those issues, do this only after you have found all the problems that can be solved in a build without optimizations.
First: Run the Application on the CPU
If your program fails when it attempts to call a kernel, the problem may be the result of an error that was detected by the SYCL, OpenMP, or OpenCL runtimes.
To fix this error:
- Force your application to use the CPU using SYCL environment variables.
- By default, SYCL applications use the oneAPI Level Zero runtime. The SYCL environment variables also allow you to switch to the OpenCL runtime for testing.
- If you experience runtime crashes:
- To get a summary of how your program got where it is, conduct a backtrace. See Backtrace in the Intel® Distribution for GDB* User Guide.
- To generate trace reports, see Profiling Tools Interfaces for GPU (PTI for GPU):
- oneTrace*: A host and device tracing tool for OpenCL runtime and oneAPI Level Zero back ends.
- zeTrace: The standard for oneAPI Level Zero API call tracing and profiling.
- Use the Intel Distribution for GDB for application-level debugging.
- Repeat the previous steps until you can verify that your kernel starts to run.
- Once you are sure that your offload kernel code runs on the CPU, try to run it on the GPU.
Resources
- Debug the DPC++ and OpenMP Offload Process.
- For instructions on how to use the Intel Distribution for GDB, see the Guided Matrix Multiply Exceptions sample available on GitHub*.
Tips for GPU Offload
- Start small. Go one kernel at a time.
- Run your kernel on only a few threads and build from there. Stepping through a few threads to find problems is much easier than stepping through thousands of threads.
Next: Run the Application on the GPU
If the kernel fails to run on the GPU, the problem is likely related to the OpenMP or SYCL runtime, the OpenCL Driver, or the oneAPI Level Zero driver.
To help triage your application, follow these steps.
- Force your application to use the GPU using SYCL environment variables.
- If you experience runtime crashes, try switching from Just in Time (JIT) compilation to Ahead of Time (AOT) Compilation.
- To switch between the default oneAPI Level Zero runtime and OpenCL runtime, use SYCL environment variables.
- If you continue to experience runtime crashes, generate trace reports using Profiling Tools Interfaces for GPU (PTI for GPU):
- oneTrace: A host and device tracing tool for OpenCL platform and oneAPI Level Zero back ends.
- zeTrace: The standard for oneAPI Level Zero API call tracing and profiling.
- If your runtime stops responding:
- Use the Profiling Tools Interfaces for GPU (PTI for GPU) to identify which operation is not responding and to further localize investigation.
- To get a summary of the status of your program, conduct a backtrace. See Backtrace in the Intel Distribution for GDB User Guide.
Note On some GPUs, ctrl-c may be used to recover the system when it stops responding.
Resources
To see how to use the oneTrace from the PTI for GPU set, see the Guided Matrix Invalid Contexts and Guided Matrix Multiply SLM Size samples available on GitHub.
Step 3: Resolve Application-Level Problems
With application kernels now running, program crashes or runtime problems are likely attributed to errors in the code.
- Focus on debugging kernel execution using the Intel Distribution for GDB, which is a model more closely aligned with traditional debug techniques.
- Compare your application results between CPU-only and accelerated implementations (where some of the program runs on an attached compute accelerator, like a GPU). If necessary, compare your results with the original application. If the results differ by more than the expected precision differences, there may be a problem in one of the implementations.
Resources
- Intel Distribution for GDB: Linux | Windows
- Debug the Offload Process
Tips for GPU Debugging
As you scale up your problem sizes and thread count, continue to debug while monitoring your overall performance improvements and correctness.
Note As you continue to modify and expand your kernel-based code, you may encounter errors or problems that require you to revert to Step 2.
- Use the Intel Distribution for GDB for application-level debugging.
- See the Intel Distribution for GDB Get Started Guide: Linux | Windows
- For common strategies, in the Intel oneAPI programming guide, see the Debug the Offload Process.
- To learn more about debugging programs with multiple threads, see Chapter 4.10 of the Intel Distribution for GDB User Guide.
- If your results are incorrect:
- Verify that the results are still accurate on the CPU.
- Use Intel® Inspector or Valgrind* to find correctness problems such as bad pointers or pointer overruns in GPU-offloaded code. The CPU OpenCL driver can be used to find GPU issues.
To learn how to use GDB stack traces to locate problems in your code, see the Guided Matrix Multiple Race Conditions and Guided Matrix Multiply Bad Buffers samples available on GitHub.
To determine how well the resulting application is working, use Intel® VTune™ Profiler.
- To make sure that the application is spending its runtime in appropriate places, run real workloads.
- Unless it is compute intensive, the code that feeds your kernels should not take much time to run. As you increase the number of threads and available hardware, the application runtime should decrease.
- Look for unexpected time spent transferring or waiting for data, or look for kernels that are taking longer than they should due to atomics, memory contention, or other issues. Use Intel VTune Profiler plus Intel® Advisor to understand how well your kernels are using the offload device and overlapping work.
Resources
Intel VTune Profiler enables you to measure and tune the performance of the entire application, not just the accelerated portion. This helps you find bottlenecks and opportunities for optimization throughout your entire application.
Roofline analysis with Intel Advisor can help show you how much optimization is available to you on different hardware configurations for each kernel. It can also show where in the pipeline to focus your attention to maximize throughput on the GPU.
The optimization advice from these two tools is complementary. Intel VTune Profiler gives you an overall assessment, and Intel Advisor identifies how much further you can improve performance in each kernel.
As you optimize your code, changes in your application may result in errors and problems that require you restart at Step 4.
Resources