Optimize Your GPU Application with the Intel® oneAPI Base Toolkit

What You Will Learn

Use this step-by-step guide to learn graphics processing unit (GPU) optimization with Intel’s latest discrete and integrated GPUs and Intel® oneAPI software.

The primary path focuses on using SYCL* standards, which are based on ISO C++ and incorporate standard SYCL and community extensions to simplify data parallel programming.
Alternatively, follow a path tailored to your needs using the tips and resources referenced in each section.

Compare the benefits of CPUs, GPUs, and FPGAs for different oneAPI compute workloads.

Who This is for

Software developers who are interested in accelerating their code using Intel's latest integrated or external graphics cards and SYCL standards for cross-code reuse. Start from your own C++ or CUDA* code or use one of Intel’s many sample applications.

You will need access to an Intel® GPU and the Intel® oneAPI Base Toolkit software. You can use your local development system, or alternatively, you can use the Intel® Tiber™ Developer Cloud. This development sandbox gives you access to Intel GPUs and oneAPI software tools. To help you choose, review Step 1: Choose Your GPU Hardware Access.

The Workflow

Step 1: Choose Your GPU Hardware Access.

Step 2: Choose Your Sample Code.

Step 3: Assess Code for Offload Opportunities with Intel® Advisor.

Step 4: Offload and Optimize Code Using Intel® Compilers and Libraries.

Step 5: Evaluate Offload Efficiency with Intel Advisor.

Step 6: Review Overall Application Performance with Intel® VTune™ Profiler.

Step 1: Choose Your GPU Hardware Access

	Intel® Tiber™ Developer Cloud	Graphics Processor from Intel
	Attend training or a hackathon to get access to Intel Tiber Developer Cloud. See the full list of upcoming events and register to improve your skills. Events Calendar Hardware The environment enables you to optimize code on real Intel GPU hardware, such as the new Intel® Iris® Xᵉ MAX GPU. Intel® Xeon® Scalable processors and Intel® FPGAs are also available. Software Intel software is preinstalled in the environment and ready to use. Software includes the Intel oneAPI Base Toolkit and add-on toolkits for HPC, AI, IoT, and more.	Use your own development environment with graphics hardware from Intel and the free Intel oneAPI Base Toolkit. Hardware Take advantage of Intel’s first discrete GPU with Intel® Iris® Xᵉ MAX graphics to explore offload and optimization strategies on the latest Intel hardware. Many Intel® platforms include integrated GPUs that can be used for basic GPU offload proof of concepts and pathfinding. Find Your GPU Architecture Software The Intel oneAPI Base Toolkit provides the core set of tools and libraries for developing high-performance applications across diverse architectures. Download the Intel oneAPI Base Toolkit for Free
Operating System	Host: Windows, Linux, macOS* Intel Tiber Developer Cloud (remote): Linux	Linux
Software	Intel oneAPI Base Toolkit (preinstalled on Intel Tiber Developer Cloud)	Intel oneAPI Base Toolkit
GPU	GPUs Available: Intel Iris Xᵉ MAX GPU (nodes available on Intel Tiber Developer Cloud)	GPUs Supported: Intel Iris Xᵉ MAX GPU Intel Iris Xᵉ graphics (integrated GPU) Intel® UHD Graphics
Language	SYCL	SYCL
Interface	Command-line interface (CLI)	Command-line interface (CLI)

View all Show less

Tip The Offload Modeling feature in Intel Advisor can help you choose the best hardware for your needs by estimating performance improvements and return on investment (ROI) across different hardware solutions.

Step 2: Choose Your Sample Code

Use Intel oneAPI Sample Code

The Intel oneAPI Base Toolkit includes a large collection of samples that demonstrate a range of methods you can use for parallelism on the GPU using SYCL. Get started with SYCL and GPU optimization using the guided ISO3DFD sample. This compute-intensive sample performs a simulation of acoustic isotropic wave propagation in a 3D medium.

Follow the instructions in the sample README file to learn how to use SYCL to adapt CPU-based code to offload resource-intensive calculations to a GPU. Work through several sample optimization iterations using Intel Advisor. Download the sample from GitHub* to Intel Tiber Developer Cloud or your development environment.

Guided ISO3DFD GPU Optimization

View All oneAPI Samples on GitHub

Resources

Use the command line to browse and download samples:
Guide | Video
Explore samples using Eclipse*: Guide | Video
Use the Microsoft Visual Studio* code extension to browse for samples: Guide

Use Your Own Code

Migrate Existing CUDA* Code

Migrate your existing CUDA code to a multiplatform program in SYCL. The Intel® DPC++ Compatibility Tool ports both CUDA language kernels and library API calls, migrating 80% to 90% of CUDA code automatically to architecture and vendor-portable code. Inline comments help you finish writing and tuning your code. Sample CUDA projects are also available to help you with the entire porting process.

Migrate your CUDA code with the Intel DPC++ Compatibility Tool.
Proceed to Step 3 to build your application and use Offload Modeling to evaluate your code for further offload opportunities.

Optimize Your Own C++ Projects

To use your own C++ code, set up your development environment and continue with this workflow. You can apply these optimization techniques directly to your existing projects.

Copy your C++ application to the Intel Tiber Developer Cloud

Use your development environment

Next, proceed to Step 3 to build your application, and then use Offload Modeling to evaluate your C++ code for further offload opportunities.

Step 3: Assess Code for Offload Opportunities with Intel® Advisor

Run Offload Modeling Analysis

Intel Advisor measures the data movement in your functions, the memory access patterns, and the amount of computation to project how code will perform on Intel GPUs. The code regions with the highest potential benefit should be your first targets for offloading.

Run Offload Modeling following these steps:

Build your sample application using appropriate environment variables. (This is required for SYCL, OpenMP*, and OpenCL™ applications.)
Run the Offload Modeling Analysis.
Review the results.

Tip Intel Advisor also offers a graphical user interface for creating projects and running analyses.

Resources

Model C++ Application Performance on a Target GPU

Identify Code Regions to Offload to GPU and Visualize GPU Use

Efficiently Offload to GPUs Using Intel Advisor (Video)

Step 4: Offload and Optimize with Intel Compilers and Libraries

Implement GPU Offload

Start by offloading the recommended code to your GPU device based on the results from Intel Advisor. Next, use a combination of techniques to develop your parallelism strategy.

Basic SYCL* Framework

To write a SYCL application:

Select the device.
Declare the device queue.
Declare buffers.
Submit the job.

Develop Parallelism Strategy

oneAPI recommends a combination of these techniques to develop your parallelism strategy:

Intel®-Optimized Libraries: Intel oneAPI Programming Guide: oneAPI Library Overview
Intel Compilers and Optimizations: Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
Parallel Programming Language or API: Intel oneAPI Programming Guide: DPC++

Resources

Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL (Book)

Explore SYCL with Samples

Essentials of SYCL for Intel Tiber Developer Cloud (Training)

Optimize GPU Offload

Your GPU optimization strategy may vary based on your application and hardware. Review these categories for tips and instructions. Complete instructions are available in the Optimization Guide.

oneAPI GPU Optimization Guide

Occupancy

Tune the global and local size to have enough threads to keep the GPU busy and hide latency. For more information, see SYCL Thread Hierarchy and Mapping.
Run multiple kernels in parallel if they're capable and if one kernel cannot fully use all execution units. For more information, see Run Multiple Kernels on the Device at the Same Time.
Minimize tail effects.

Calculate GPU Occupancy

Determine the occupancy of Intel GPUs using the Intel GPU Occupancy Calculator on GitHub

Device Kernel Code

Avoid register spills.
Adjust the subgroup and size. For more information, see Subgroups.
Use shared local memory to eliminate redundant global memory access. For details, see Shared Local Memory.
Apply hierarchical atomic optimizations to reduce global atomic memory updates. For more information, see Data Types for Atomic Operations.
Minimize synchronizations between work items and threads. For details, see Synchronization Among Threads in a Kernel.
Minimize code divergency. For more information, see Removing Conditional Checks.
Use directives and attributes to help the compiler to better optimize kernel code. For details, see Restrict Directive.
Include optimized library functions. For more information, see Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library.
Consider advanced compiler optimization techniques.

Memory

Tune memory access patterns to improve locality and cache use. For more information, see Subgroups.
Optimize the payload of memory transactions.
Take advantage of memory block loads and stores. For details, see Subgroups and Reduction.
Use shared memory to reduce redundant global memory access. For more information, see Shared Local Memory.
Avoid shared memory bank conflicts. For details, see Shared Local Memory.

Host-Device Data Transfer

Choose the best memory allocation types and buffer access modes to minimize data transfer between the device and host. For more information, see Memory.
Reduce moving data back and forth between the host and device.
Eliminate unnecessary buffer creation and memory allocation. For details, see Avoid Declaring Buffers in a Loop.

Host-Device Concurrency

Minimize host-device synchronizations to maximize parallel execution between host and device. For more information, see Asynchronous and Overlapping Data Transfers Between the Host and Device.

Step 5: Evaluate Offload Efficiency with Intel Advisor

Once you have modified your application, return to Intel Advisor to help you measure the actual performance of offloaded code using the GPU Roofline Insights analysis. Intel Advisor uses benchmarks and hardware metric profiling to measure GPU kernel performance. It points out limitations and identifies areas of your code where further optimization will have the most payoff.

Run GPU Roofline Insights and Revise Offload Code

Evaluate GPU code to see how close the performance is to hardware maximums:

Set up your environment to analyze GPU kernels.
Run Roofline Analysis.
Review results to evaluate throughput based on hardware models.
If bottlenecks are identified, return to Step 4: Offload and Optimize, and then rewrite the code to address the issues.

Step 6: Review Overall Application Performance with Intel® VTune™ Profiler

Create a Baseline Snapshot of Application Performance

Use Performance Snapshot to create an application performance baseline and identify focus areas for further analysis.

Set up your system for GPU analysis.
Launch the Intel VTune Profiler command-line interface.
Run the Performance Snapshot analysis.
View the results.
- On Intel Tiber Developer Cloud, view the Intel VTune Profiler summary report.
- On Intel Tiber Developer Cloud with Intel VTune Profiler installed locally: Copy the results to your local system, create a project, and import it into Intel VTune Profiler.
- Intel oneAPI Base Toolkit: View results in Intel VTune Profiler.

Assess Application for CPU-bound or GPU-bound Issues

Start optimizing for CPU by reviewing how much time was spent transferring operations between the host and device. Next, further optimize for the GPU by identifying areas of inefficient GPU usage.

Run the GPU Offload analysis.
Optimize CPU performance in your application.
Run the GPU Compute/Media Hotspots analysis.
Return to Step 4: Offload and Optimize to further optimize GPU performance in your application.

Resources

Profile a SYCL Application Running on a GPU

Optimize Applications for Intel GPUs with Intel VTune Profiler

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Optimize Your GPU Application with the Intel® oneAPI Base Toolkit

What You Will Learn

Who This is for

The Workflow

Step 1: Choose Your GPU Hardware Access

Intel® Tiber™ Developer Cloud

Graphics Processor from Intel

Step 2: Choose Your Sample Code

Use Intel oneAPI Sample Code

Use Your Own Code

Step 3: Assess Code for Offload Opportunities with Intel® Advisor

Run Offload Modeling Analysis

Resources

Step 4: Offload and Optimize with Intel Compilers and Libraries

Implement GPU Offload

Optimize GPU Offload

Step 5: Evaluate Offload Efficiency with Intel Advisor

Step 6: Review Overall Application Performance with Intel® VTune™ Profiler

Create a Baseline Snapshot of Application Performance

Assess Application for CPU-bound or GPU-bound Issues