Optimize Your GPU Application with the Intel® oneAPI Base Toolkit
What You Will Learn
Use this step-by-step guide to learn graphics processing unit (GPU) optimization with Intel’s latest discrete and integrated GPUs and Intel® oneAPI software.
- The primary path focuses on using SYCL* standards, which are based on ISO C++ and incorporate standard SYCL and community extensions to simplify data parallel programming.
- Alternatively, follow a path tailored to your needs using the tips and resources referenced in each section.
Compare the benefits of CPUs, GPUs, and FPGAs for different oneAPI compute workloads.
Who This is for
Software developers who are interested in accelerating their code using Intel's latest integrated or external graphics cards and SYCL standards for cross-code reuse. Start from your own C++ or CUDA* code or use one of Intel’s many sample applications.
You will need access to an Intel® GPU and the Intel® oneAPI Base Toolkit software. You can use your local development system, or alternatively, you can use the Intel® Tiber™ Developer Cloud. This development sandbox gives you access to Intel GPUs and oneAPI software tools. To help you choose, review Step 1: Choose Your GPU Hardware Access.
The Workflow
Step 1: Choose Your GPU Hardware Access.
Step 2: Choose Your Sample Code.
Step 3: Assess Code for Offload Opportunities with Intel® Advisor.
Step 4: Offload and Optimize Code Using Intel® Compilers and Libraries.
Step 5: Evaluate Offload Efficiency with Intel Advisor.
Step 6: Review Overall Application Performance with Intel® VTune™ Profiler.
Step 1: Choose Your GPU Hardware Access
Whether you have the latest Intel® architecture or need cloud resources, your optimization strategy starts with hardware. To make the best of your GPU offload, you need to understand what hardware resources you need to optimize for.
Hardware and Software Options
- Use Intel hardware and software in the Intel Tiber Developer Cloud.
- Use your own graphics processor from Intel and install the free Intel oneAPI Base Toolkit.
Intel® Tiber™ Developer Cloud |
Graphics Processor from Intel |
|
---|---|---|
Attend training or a hackathon to get access to Intel Tiber Developer Cloud. See the full list of upcoming events and register to improve your skills. Events Calendar Hardware The environment enables you to optimize code on real Intel GPU hardware, such as the new Intel® Iris® Xᵉ MAX GPU. Intel® Xeon® Scalable processors and Intel® FPGAs are also available. Software Intel software is preinstalled in the environment and ready to use. Software includes the Intel oneAPI Base Toolkit and add-on toolkits for HPC, AI, IoT, and more. |
Use your own development environment with graphics hardware from Intel and the free Intel oneAPI Base Toolkit. Hardware Take advantage of Intel’s first discrete GPU with Intel® Iris® Xᵉ MAX graphics to explore offload and optimization strategies on the latest Intel hardware. Many Intel® platforms include integrated GPUs that can be used for basic GPU offload proof of concepts and pathfinding. Software The Intel oneAPI Base Toolkit provides the core set of tools and libraries for developing high-performance applications across diverse architectures. |
|
Operating System | Host: Windows*, Linux*, macOS* Intel Tiber Developer Cloud (remote): Linux |
Linux |
Software | Intel oneAPI Base Toolkit (preinstalled on Intel Tiber Developer Cloud) | Intel oneAPI Base Toolkit |
GPU | GPUs Available: Intel Iris Xᵉ MAX GPU (nodes available on Intel Tiber Developer Cloud) |
GPUs Supported:
|
Language | SYCL | SYCL |
Interface | Command-line interface (CLI) | Command-line interface (CLI) |
Step 2: Choose Your Sample Code
You can start from an Intel sample, your existing CUDA source code, or your own C++ application.
Use Intel oneAPI Sample Code
The Intel oneAPI Base Toolkit includes a large collection of samples that demonstrate a range of methods you can use for parallelism on the GPU using SYCL. Get started with SYCL and GPU optimization using the guided ISO3DFD sample. This compute-intensive sample performs a simulation of acoustic isotropic wave propagation in a 3D medium.
Follow the instructions in the sample README file to learn how to use SYCL to adapt CPU-based code to offload resource-intensive calculations to a GPU. Work through several sample optimization iterations using Intel Advisor. Download the sample from GitHub* to Intel Tiber Developer Cloud or your development environment.
Guided ISO3DFD GPU Optimization
View All oneAPI Samples on GitHub
Resources
- Use the command line to browse and download samples:
Guide | Video - Explore samples using Eclipse*: Guide | Video
- Use the Microsoft Visual Studio* code extension to browse for samples: Guide
Use Your Own Code
Migrate Existing CUDA* Code
Migrate your existing CUDA code to a multiplatform program in SYCL. The Intel® DPC++ Compatibility Tool ports both CUDA language kernels and library API calls, migrating 80% to 90% of CUDA code automatically to architecture and vendor-portable code. Inline comments help you finish writing and tuning your code. Sample CUDA projects are also available to help you with the entire porting process.
- Migrate your CUDA code with the Intel DPC++ Compatibility Tool.
- Proceed to Step 3 to build your application and use Offload Modeling to evaluate your code for further offload opportunities.
Optimize Your Own C++ Projects
To use your own C++ code, set up your development environment and continue with this workflow. You can apply these optimization techniques directly to your existing projects.
- Copy your C++ application to the Intel Tiber Developer Cloud
or
- Use your development environment
Next, proceed to Step 3 to build your application, and then use Offload Modeling to evaluate your C++ code for further offload opportunities.
Step 3: Assess Code for Offload Opportunities with Intel® Advisor
Intel Advisor analyzes your code and helps you identify the best opportunities for GPU offloading. The Offload Modeling feature provides performance speedup projections, estimates offload overhead, and pinpoints performance bottlenecks. Offload Modeling enables you to improve ROI by modeling different hardware solutions to maximize your performance.
Run Offload Modeling Analysis
Intel Advisor measures the data movement in your functions, the memory access patterns, and the amount of computation to project how code will perform on Intel GPUs. The code regions with the highest potential benefit should be your first targets for offloading.
Run Offload Modeling following these steps:
- Build your sample application using appropriate environment variables. (This is required for SYCL, OpenMP*, and OpenCL™ applications.)
- Run the Offload Modeling Analysis.
- Review the results.
Tip Intel Advisor also offers a graphical user interface for creating projects and running analyses.
Step 4: Offload and Optimize with Intel Compilers and Libraries
Select the best optimization strategy to modify your code based on your application needs, advice from Intel Advisor, and available hardware. Documentation, samples, and training help you make design decisions to maximize performance.
Implement GPU Offload
Start by offloading the recommended code to your GPU device based on the results from Intel Advisor. Next, use a combination of techniques to develop your parallelism strategy.
Basic SYCL* Framework
To write a SYCL application:
- Select the device.
- Declare the device queue.
- Declare buffers.
- Submit the job.
Develop Parallelism Strategy
oneAPI recommends a combination of these techniques to develop your parallelism strategy:
- Intel®-Optimized Libraries: Intel oneAPI Programming Guide: oneAPI Library Overview
- Intel Compilers and Optimizations: Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
- Parallel Programming Language or API: Intel oneAPI Programming Guide: DPC++
Optimize GPU Offload
Your GPU optimization strategy may vary based on your application and hardware. Review these categories for tips and instructions. Complete instructions are available in the Optimization Guide.
Occupancy
- Tune the global and local size to have enough threads to keep the GPU busy and hide latency. For more information, see SYCL Thread Hierarchy and Mapping.
- Run multiple kernels in parallel if they're capable and if one kernel cannot fully use all execution units. For more information, see Run Multiple Kernels on the Device at the Same Time.
- Minimize tail effects.
Calculate GPU Occupancy
- Determine the occupancy of Intel GPUs using the Intel GPU Occupancy Calculator on GitHub
Device Kernel Code
- Avoid register spills.
- Adjust the subgroup and size. For more information, see Subgroups.
- Use shared local memory to eliminate redundant global memory access. For details, see Shared Local Memory.
- Apply hierarchical atomic optimizations to reduce global atomic memory updates. For more information, see Data Types for Atomic Operations.
- Minimize synchronizations between work items and threads. For details, see Synchronization Among Threads in a Kernel.
- Minimize code divergency. For more information, see Removing Conditional Checks.
- Use directives and attributes to help the compiler to better optimize kernel code. For details, see Restrict Directive.
- Include optimized library functions. For more information, see Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library.
- Consider advanced compiler optimization techniques.
Memory
- Tune memory access patterns to improve locality and cache use. For more information, see Subgroups.
- Optimize the payload of memory transactions.
- Take advantage of memory block loads and stores. For details, see Subgroups and Reduction.
- Use shared memory to reduce redundant global memory access. For more information, see Shared Local Memory.
- Avoid shared memory bank conflicts. For details, see Shared Local Memory.
Host-Device Data Transfer
- Choose the best memory allocation types and buffer access modes to minimize data transfer between the device and host. For more information, see Memory.
- Reduce moving data back and forth between the host and device.
- Eliminate unnecessary buffer creation and memory allocation. For details, see Avoid Declaring Buffers in a Loop.
Host-Device Concurrency
- Minimize host-device synchronizations to maximize parallel execution between host and device. For more information, see Asynchronous and Overlapping Data Transfers Between the Host and Device.
Step 5: Evaluate Offload Efficiency with Intel Advisor
Once you have modified your application, return to Intel Advisor to help you measure the actual performance of offloaded code using the GPU Roofline Insights analysis. Intel Advisor uses benchmarks and hardware metric profiling to measure GPU kernel performance. It points out limitations and identifies areas of your code where further optimization will have the most payoff.
Run GPU Roofline Insights and Revise Offload Code
Evaluate GPU code to see how close the performance is to hardware maximums:
- Set up your environment to analyze GPU kernels.
- Run Roofline Analysis.
- Review results to evaluate throughput based on hardware models.
- If bottlenecks are identified, return to Step 4: Offload and Optimize, and then rewrite the code to address the issues.
Step 6: Review Overall Application Performance with Intel® VTune™ Profiler
After optimizing your GPU offload code, use Intel VTune Profiler to optimize overall application performance on all devices. Intel VTune Profiler offers helpful optimization guidance within the analysis results.
Tip Intel VTune Profiler also offers a graphical user interface for creating projects and running an analysis.
Create a Baseline Snapshot of Application Performance
Use Performance Snapshot to create an application performance baseline and identify focus areas for further analysis.
- Set up your system for GPU analysis.
- Launch the Intel VTune Profiler command-line interface.
- Run the Performance Snapshot analysis.
- View the results.
- On Intel Tiber Developer Cloud, view the Intel VTune Profiler summary report.
- On Intel Tiber Developer Cloud with Intel VTune Profiler installed locally: Copy the results to your local system, create a project, and import it into Intel VTune Profiler.
- Intel oneAPI Base Toolkit: View results in Intel VTune Profiler.
Assess Application for CPU-bound or GPU-bound Issues
Start optimizing for CPU by reviewing how much time was spent transferring operations between the host and device. Next, further optimize for the GPU by identifying areas of inefficient GPU usage.
- Run the GPU Offload analysis.
- Optimize CPU performance in your application.
- Run the GPU Compute/Media Hotspots analysis.
- Return to Step 4: Offload and Optimize to further optimize GPU performance in your application.
Resources
Profile a SYCL Application Running on a GPU
Optimize Applications for Intel GPUs with Intel VTune Profiler