SYCL* Thread and Memory Hierarchy

Intel® oneAPI Programming Guide

Download PDF

ID 771723

Date 7/14/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-CE39C976-F535-4C27-99E8-5832AF026926

View Details

Document Table of Contents

Document Table of Contents x

Intel® oneAPI Programming Guide

Intel® oneAPI Programming Guide x

Introduction to oneAPI Programming oneAPI Programming Model oneAPI Development Environment Setup Compile and Run oneAPI Programs API-based Programming Software Development Process Glossary Notices and Disclaimers

Introduction to oneAPI Programming x

Intel oneAPI Programming Overview oneAPI Toolkit Distribution Related Documentation

oneAPI Programming Model x

Data Parallelism in C++ using SYCL* C/C++ or Fortran with OpenMP* Offload Programming Model Device Selection SYCL* Thread and Memory Hierarchy Thread Hierarchy Memory Hierarchy

oneAPI Development Environment Setup x

Use the setvars Script with Windows* Use the setvars Script with Linux* or MacOS* Use Modulefiles with Linux* Use CMake with oneAPI Applications

Use the setvars Script with Windows* x

Use a Config file for setvars.bat on Windows Automate the setvars.bat Script with Microsoft Visual Studio*

Use the setvars Script with Linux* or MacOS* x

Use a Config file for setvars.sh on Linux or macOS Automate the setvars.sh Script with Eclipse*

Compile and Run oneAPI Programs x

Single Source Compilation Invoke the Compiler Standard Intel oneAPI DPC++/C++ Compiler Options Example Compilation Compilation Flow Overview CPU Flow GPU Flow FPGA Flow

CPU Flow x

Traditional CPU Flow CPU Offload Flow

CPU Offload Flow x

Example CPU Commands Ahead-of-Time Compilation for CPU Architectures Control Binary Execution on Multiple CPU Cores

GPU Flow x

GPU Offload Flow Example GPU Commands Ahead-of-Time Compilation for GPU

FPGA Flow x

Why is FPGA Compilation Different? Types of SYCL* FPGA Compilation FPGA Compilation Flags Emulate and Debug Your Design Evaluate Your Kernel Through Simulation Device Selectors for FPGA FPGA IP Authoring Flow Fast Recompile for FPGA Generate Multiple FPGA Images (Linux only) FPGA BSPs and Boards Targeting Multiple Homogeneous FPGA Devices Targeting Multiple Platforms FPGA-CPU Interaction FPGA Performance Optimization Use of RTL Libraries for FPGA Use SYCL Shared Library With Third-Party Applications FPGA Workflows in IDEs

Emulate and Debug Your Design x

Emulator Environment Variables Emulate Pipe Depth Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe Compile and Emulate Your Design Limitations of the Emulator Troubleshooting Discrepancies in Hardware and Emulator Results Emulator Known Issues

Evaluate Your Kernel Through Simulation x

Simulation Prerequisites Installing the Questa*-Intel FPGA Edition Software Set Up the Simulation Environment Compile a Kernel for Simulation Simulate Your Kernel Viewing Simulation Waveforms Troubleshoot Simulator Issues

FPGA IP Authoring Flow x

Code IP Components in SYCL* Emulate and Debug Your IP Component Evaluate Your IP Component Through Simulation FPGA IP Component Performance Optimization Synthesizing Your Component IP with Intel® Quartus® Prime Software Integrating Your IP Into a System Encrypt IP Components for Distribution

Code IP Components in SYCL* x

Customize RTL Interfaces Suggested Coding Styles Memory-Mapped Host Interfaces Host Pipes Agent IP Component Kernels Streaming IP Component Kernels Kernel Argument Interfaces Pipelined Kernels Stable Arguments IP Component Reset Behavior The printf Command

Memory-Mapped Host Interfaces x

Memory-Mapped Host Interfaces Using Unified Shared Memory Memory-Mapped Interface Unified-Shared-Memory Virtual Address Space Memory-Mapped Host Interfaces Using Accessors

Evaluate Your IP Component Through Simulation x

Debug During Verification

Integrating Your IP Into a System x

Adding IP into an Intel® Quartus® Prime Project Adding IP into a Platform Designer System

FPGA BSPs and Boards x

FPGA Board Initialization Obtain FPGA Hardware Image Information

Use of RTL Libraries for FPGA x

Object Manifest File Syntax of an RTL Library Restrictions and Limitations in RTL Support Intel® Stratix® 10 and Intel Agilex® 7 Design-Specific Reset Requirements for Stall-Free and Stallable RTL Libraries

API-based Programming x

Intel oneAPI DPC++ Library (oneDPL) Intel oneAPI Math Kernel Library (oneMKL) Intel oneAPI Threading Building Blocks (oneTBB) Intel oneAPI Data Analytics Library (oneDAL) Intel oneAPI Collective Communications Library (oneCCL) Intel oneAPI Deep Neural Network Library (oneDNN) Intel oneAPI Video Processing Library (oneVPL) Other Libraries

Intel oneAPI DPC++ Library (oneDPL) x

oneDPL Library Usage oneDPL Code Sample

Intel oneAPI Math Kernel Library (oneMKL) x

oneMKL Usage oneMKL Code Sample

Intel oneAPI Threading Building Blocks (oneTBB) x

oneTBB Usage oneTBB Code Sample

Intel oneAPI Data Analytics Library (oneDAL) x

oneDAL Usage oneDAL Code Sample

Intel oneAPI Collective Communications Library (oneCCL) x

oneCCL Usage oneCCL Code Sample

Intel oneAPI Deep Neural Network Library (oneDNN) x

oneDNN Usage oneDNN Code Sample

Intel oneAPI Video Processing Library (oneVPL) x

oneVPL Usage oneVPL Code Sample

Software Development Process x

Migrating Code to SYCL* and DPC++ Composability Debugging the DPC++ and OpenMP* Offload Process Performance Tuning Cycle oneAPI Library Compatibility SYCL* Extensions

Migrating Code to SYCL* and DPC++ x

Migrating from C++ to SYCL* Migrating from CUDA* to SYCL* for the DPC++ Compiler Migrating from OpenCL Code to SYCL* Migrating Between CPU, GPU, and FPGA

Composability x

C/C++ OpenMP* and SYCL* Composability OpenCL™ Code Interoperability

Debugging the DPC++ and OpenMP* Offload Process x

oneAPI Debug Tools for SYCL* and OpenMP* Development Trace the Offload Process Debug the Offload Process Optimize Offload Performance

Debug the Offload Process x

Using the SYCL* Exception Handler

Performance Tuning Cycle x

Establish Baseline Identify Kernels to Offload Offload Kernels Optimize Your SYCL* Applications Recompile, Run, Profile, and Repeat

Intel® oneAPI Programming Guide

Introduction to oneAPI Programming

Intel oneAPI Programming Overview

oneAPI Toolkit Distribution

Related Documentation

oneAPI Programming Model

Data Parallelism in C++ using SYCL*

C/C++ or Fortran with OpenMP* Offload Programming Model

Device Selection

SYCL* Thread and Memory Hierarchy

Thread Hierarchy
Memory Hierarchy

oneAPI Development Environment Setup

Use the setvars Script with Windows*

Use a Config file for setvars.bat on Windows

Automate the setvars.bat Script with Microsoft Visual Studio*

Use the setvars Script with Linux* or MacOS*

Use a Config file for setvars.sh on Linux or macOS

Automate the setvars.sh Script with Eclipse*

Use Modulefiles with Linux*

Use CMake with oneAPI Applications

Compile and Run oneAPI Programs

Single Source Compilation

Invoke the Compiler

Standard Intel oneAPI DPC++/C++ Compiler Options

Example Compilation

Compilation Flow Overview

CPU Flow

Traditional CPU Flow

CPU Offload Flow

Example CPU Commands

Ahead-of-Time Compilation for CPU Architectures

Control Binary Execution on Multiple CPU Cores

GPU Flow

GPU Offload Flow

Example GPU Commands

Ahead-of-Time Compilation for GPU

FPGA Flow

Why is FPGA Compilation Different?

Types of SYCL* FPGA Compilation

FPGA Compilation Flags

Emulate and Debug Your Design

Emulator Environment Variables

Emulate Pipe Depth

Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe

Compile and Emulate Your Design

Limitations of the Emulator

Troubleshooting Discrepancies in Hardware and Emulator Results

Emulator Known Issues

Evaluate Your Kernel Through Simulation

Simulation Prerequisites

Installing the Questa*-Intel FPGA Edition Software

Set Up the Simulation Environment

Compile a Kernel for Simulation

Simulate Your Kernel

Viewing Simulation Waveforms

Troubleshoot Simulator Issues

Device Selectors for FPGA

FPGA IP Authoring Flow

Code IP Components in SYCL*

Customize RTL Interfaces

Suggested Coding Styles

Memory-Mapped Host Interfaces

Memory-Mapped Host Interfaces Using Unified Shared Memory

Memory-Mapped Interface Unified-Shared-Memory Virtual Address Space

Memory-Mapped Host Interfaces Using Accessors

Host Pipes

Agent IP Component Kernels

Streaming IP Component Kernels

Kernel Argument Interfaces

Pipelined Kernels

Stable Arguments

IP Component Reset Behavior

The printf Command

Emulate and Debug Your IP Component

Evaluate Your IP Component Through Simulation

Debug During Verification

FPGA IP Component Performance Optimization

Synthesizing Your Component IP with Intel® Quartus® Prime Software

Integrating Your IP Into a System

Adding IP into an Intel® Quartus® Prime Project

Adding IP into a Platform Designer System

Encrypt IP Components for Distribution

Fast Recompile for FPGA

Generate Multiple FPGA Images (Linux only)

FPGA BSPs and Boards

FPGA Board Initialization

Obtain FPGA Hardware Image Information

Targeting Multiple Homogeneous FPGA Devices

Targeting Multiple Platforms

FPGA-CPU Interaction

FPGA Performance Optimization

Use of RTL Libraries for FPGA

Object Manifest File Syntax of an RTL Library

Restrictions and Limitations in RTL Support

Intel® Stratix® 10 and Intel Agilex® 7 Design-Specific Reset Requirements for Stall-Free and Stallable RTL Libraries

Use SYCL Shared Library With Third-Party Applications

FPGA Workflows in IDEs

API-based Programming

Intel oneAPI DPC++ Library (oneDPL)

oneDPL Library Usage

oneDPL Code Sample

Intel oneAPI Math Kernel Library (oneMKL)

oneMKL Usage

oneMKL Code Sample

Intel oneAPI Threading Building Blocks (oneTBB)

oneTBB Usage

oneTBB Code Sample

Intel oneAPI Data Analytics Library (oneDAL)

oneDAL Usage

oneDAL Code Sample

Intel oneAPI Collective Communications Library (oneCCL)

oneCCL Usage

oneCCL Code Sample

Intel oneAPI Deep Neural Network Library (oneDNN)

oneDNN Usage

oneDNN Code Sample

Intel oneAPI Video Processing Library (oneVPL)

oneVPL Usage

oneVPL Code Sample

Other Libraries

Software Development Process

Migrating Code to SYCL* and DPC++

Migrating from C++ to SYCL*

Migrating from CUDA* to SYCL* for the DPC++ Compiler

Migrating from OpenCL Code to SYCL*

Migrating Between CPU, GPU, and FPGA

Composability

C/C++ OpenMP* and SYCL* Composability

OpenCL™ Code Interoperability

Debugging the DPC++ and OpenMP* Offload Process

oneAPI Debug Tools for SYCL* and OpenMP* Development

Trace the Offload Process

Debug the Offload Process

Using the SYCL* Exception Handler

Optimize Offload Performance

Performance Tuning Cycle

Establish Baseline

Identify Kernels to Offload

Offload Kernels

Optimize Your SYCL* Applications

Recompile, Run, Profile, and Repeat

oneAPI Library Compatibility

SYCL* Extensions

Glossary

Notices and Disclaimers

Visible to Intel only — GUID: GUID-CE39C976-F535-4C27-99E8-5832AF026926

View Details

SYCL* Thread and Memory Hierarchy

Thread Hierarchy

The SYCL* execution model exposes an abstract view of GPU execution. The SYCL thread hierarchy consists of a 1-, 2-, or 3-dimensional grid of work-items. These work-items are grouped into equal sized thread groups called work-groups. Threads in a work-group are further divided into equal sized vector groups called sub-groups.

To learn more about how this hierarchy works with a GPUor a CPU with Intel® UHD Graphics, see SYCL* Thread Mapping and GPU Occupancy in the oneAPI GPU Optimization Guide.

Memory Hierarchy

The General Purpose GPU (GPGPU) compute model consists of a host connected to one or more compute devices. Each compute device consists of many GPU Compute Engines (CE), also known as Execution Units (EU) or Xe Vector Engines (XVE). The compute devices may also include caches, shared local memory (SLM), high-bandwidth memory (HBM), and so on, as shown in the figure below. Applications are then built as a combination of host software (per the host framework) and kernels submitted by the host to run on the VEs with a predefined decoupling point.

To learn more about memory hierarchy within the General Purpose GPU (GPGPU) compute model, see Execution Model Overview in the oneAPI GPU Optimization Guide.

Level Two Title

Device Selection oneAPI Development Environment Setup

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Programming Guide

SYCL* Thread and Memory Hierarchy

Thread Hierarchy

Memory Hierarchy