Terminology

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 11/19/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

oneAPI GPU Optimization Guide

oneAPI GPU Optimization Guide x

Introduction Getting Started Parallelization Intel® Xe GPU Architecture General-Purpose Computing on GPU Media Graphics Computing on GPU References Terms and Conditions

General-Purpose Computing on GPU x

Execution Model Overview Thread Mapping and GPU Occupancy Kernels Using Libraries for GPU Offload Host/Device Memory, Buffer and USM Unified Shared Memory Allocations Performance Impact of USM and Buffers Avoiding Moving Data Back and Forth between Host and Device Optimizing Data Transfers Avoiding Declaring Buffers in a Loop Buffer Accessor Modes Host/Device Coordination Using Multiple Heterogeneous Devices Compilation OpenMP Offloading Tuning Guide Multi-GPU and Multi-Stack Architecture and Programming Level Zero Performance Profiling and Analysis Configuring GPU Device

Kernels x

Sub-Groups and SIMD Vectorization Removing Conditional Checks Registers and Performance Shared Local Memory Pointer Aliasing and the Restrict Directive Synchronization among Threads in a Kernel Considerations for Selecting Work-Group Size Prefetch Reduction Kernel Launch Executing Multiple Kernels on the Device at the Same Time Submitting Kernels to Multiple Queues Avoiding Redundant Queue Constructions Programming Intel® XMX Using SYCL Joint Matrix Extension Doing I/O in the Kernel Optimizing Explicit SIMD Kernels

Registers and Performance x

Finding Kernels with Register Spills Small Register Mode vs. Large Register Mode Optimizing Register Spills Porting Code with High Register Pressure to Intel® Max GPUs

Synchronization among Threads in a Kernel x

Atomic Operations Local Barriers vs Global Atomics

Using Libraries for GPU Offload x

Using Performance Libraries Using Standard Library Functions in SYCL Kernels Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library (oneMKL) Boost Matrix Multiplication Performance with Intel® Xe Matrix Extensions

Host/Device Coordination x

Asynchronous and Overlapping Data Transfers Between Host and Device

Compilation x

Just-In-Time Compilation Ahead-Of-Time Compilation Specialization Constants Accuracy Versus Performance Tradeoffs in Floating-Point Computations

OpenMP Offloading Tuning Guide x

OpenMP Directives OpenMP Execution Model Terminology Compiling and Running an OpenMP Application Offloading oneMKL Computations onto the GPU Tools for Analyzing Performance of OpenMP Applications OpenMP Offload Best Practices

OpenMP Offload Best Practices x

Using More GPU Resources Minimizing Data Transfers and Memory Allocations Making Better Use of OpenMP Constructs Memory Allocation Fortran Example Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr Prefetching Atomics with SLM OpenMP Interop with SYCL Offloading DO CONCURRENT

Multi-GPU and Multi-Stack Architecture and Programming x

Multi-Stack GPU Architecture Exposing the Device Hierarchy FLAT Mode Programming COMPOSITE Mode Programming Using Intel® oneAPI Math Kernel Library (oneMKL) Using Intel® MPI Library Advanced Topics Terminology

FLAT Mode Programming x

FLAT Mode Example - SYCL FLAT Mode Example - OpenMP

COMPOSITE Mode Programming x

Explicit Scaling Implicit Scaling

Explicit Scaling x

Explicit Scaling - SYCL Explicit Scaling - OpenMP Explicit Scaling Summary

Implicit Scaling x

Introduction Work Scheduling and Memory Distribution STREAM Example Programming Principles

Using Intel® oneAPI Math Kernel Library (oneMKL) x

Scaling Performance with Intel® oneAPI Math Kernel Library(oneMKL) in SYCL Scaling Performance with Intel® oneAPI Math Kernel Library(oneMKL) in OpenMP

Using Intel® MPI Library x

Running MPI Applications on GPUs Intra-Device and Inter-Device Data Transfers for MPI+OpenMP Programs

Level Zero x

Immediate Command Lists

Performance Profiling and Analysis x

Using the Timers Intel® VTuneTM Profiler Intel® Advisor Intel® Intercept Layer for OpenCLTM Applications Performance Tools in Intel® Profiling Tools Interfaces for GPU

Media Graphics Computing on GPU x

Optimizing Media Pipelines Performance Analysis with Intel® Graphics Performance Analyzers

Optimizing Media Pipelines x

Media Engine Hardware Media API Options for Hardware Acceleration Media Pipeline Parallelism Media Pipeline Inter-operation and Memory Sharing SYCL-Blur Example

oneAPI GPU Optimization Guide

Introduction

Getting Started

Parallelization

Intel® Xe GPU Architecture

General-Purpose Computing on GPU

Execution Model Overview

Thread Mapping and GPU Occupancy

Kernels

Sub-Groups and SIMD Vectorization

Removing Conditional Checks

Registers and Performance

Finding Kernels with Register Spills

Small Register Mode vs. Large Register Mode

Optimizing Register Spills

Porting Code with High Register Pressure to Intel® Max GPUs

Shared Local Memory

Pointer Aliasing and the Restrict Directive

Synchronization among Threads in a Kernel

Atomic Operations

Local Barriers vs Global Atomics

Considerations for Selecting Work-Group Size

Prefetch

Reduction

Kernel Launch

Executing Multiple Kernels on the Device at the Same Time

Submitting Kernels to Multiple Queues

Avoiding Redundant Queue Constructions

Programming Intel® XMX Using SYCL Joint Matrix Extension

Doing I/O in the Kernel

Optimizing Explicit SIMD Kernels

Using Libraries for GPU Offload

Using Performance Libraries

Using Standard Library Functions in SYCL Kernels

Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library (oneMKL)

Boost Matrix Multiplication Performance with Intel® Xe Matrix Extensions

Host/Device Memory, Buffer and USM

Unified Shared Memory Allocations

Performance Impact of USM and Buffers

Avoiding Moving Data Back and Forth between Host and Device

Optimizing Data Transfers

Avoiding Declaring Buffers in a Loop

Buffer Accessor Modes

Host/Device Coordination

Asynchronous and Overlapping Data Transfers Between Host and Device

Using Multiple Heterogeneous Devices

Compilation

Just-In-Time Compilation

Ahead-Of-Time Compilation

Specialization Constants

Accuracy Versus Performance Tradeoffs in Floating-Point Computations

OpenMP Offloading Tuning Guide

OpenMP Directives

OpenMP Execution Model

Terminology

Compiling and Running an OpenMP Application

Offloading oneMKL Computations onto the GPU

Tools for Analyzing Performance of OpenMP Applications

OpenMP Offload Best Practices

Using More GPU Resources

Minimizing Data Transfers and Memory Allocations

Making Better Use of OpenMP Constructs

Memory Allocation

Fortran Example

Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr

Prefetching

Atomics with SLM

OpenMP Interop with SYCL

Offloading DO CONCURRENT

Multi-GPU and Multi-Stack Architecture and Programming

Multi-Stack GPU Architecture

Exposing the Device Hierarchy

FLAT Mode Programming

FLAT Mode Example - SYCL

FLAT Mode Example - OpenMP

COMPOSITE Mode Programming

Explicit Scaling

Explicit Scaling - SYCL

Explicit Scaling - OpenMP

Explicit Scaling Summary

Implicit Scaling

Introduction

Work Scheduling and Memory Distribution

STREAM Example

Programming Principles

Using Intel® oneAPI Math Kernel Library (oneMKL)

Scaling Performance with Intel® oneAPI Math Kernel Library(oneMKL) in SYCL

Scaling Performance with Intel® oneAPI Math Kernel Library(oneMKL) in OpenMP

Using Intel® MPI Library

Running MPI Applications on GPUs

Intra-Device and Inter-Device Data Transfers for MPI+OpenMP Programs

Advanced Topics

Terminology

Level Zero

Immediate Command Lists

Performance Profiling and Analysis

Using the Timers

Intel® VTuneTM Profiler

Intel® Advisor

Intel® Intercept Layer for OpenCLTM Applications

Performance Tools in Intel® Profiling Tools Interfaces for GPU

Configuring GPU Device

Media Graphics Computing on GPU

Optimizing Media Pipelines

Media Engine Hardware

Media API Options for Hardware Acceleration

Media Pipeline Parallelism

Media Pipeline Inter-operation and Memory Sharing

SYCL-Blur Example

Performance Analysis with Intel® Graphics Performance Analyzers

References

Terms and Conditions

Terminology

In this chapter, OpenMP and SYCL terminology is used interchangeably to describe the partitioning of iterations of an offloaded parallel loop.

As described in the “SYCL Thread Hierarchy and Mapping” chapter, the iterations of a parallel loop (execution range) offloaded onto the GPU are divided into work-groups, sub-groups, and work-items. The ND-range represents the total execution range, which is divided into work-groups of equal size. A work-group is a 1-, 2-, or 3-dimensional set of work-items. Each work-group can be divided into sub-groups. A sub-group represents a short range of consecutive work-items that are processed together as a SIMD vector.

The following table shows how SYCL concepts map to OpenMP and CUDA concepts.

SYCL	OpenMP	CUDA
Work-item	OpenMP thread or SIMD lane	CUDA thread
Work-group	Team	Thread block
Work-group size	Team size	Thread block size
Number of work-groups	Number of teams	Number of thread blocks
Sub-group	SIMD chunk (`simdlen` = 8, 16, 32)	Warp (size = 32)
Maximum number of work-items per work-group	Thread limit	Maximum number of of CUDA threads per thread block

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Terminology