Contiguous Memory Accesses

Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

Download PDF

ID 767853

Date 7/13/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

FPGA Optimization Guide for Intel® oneAPI Toolkits Introduction To FPGA Design Concepts Analyze Your Design Optimize Your Design FPGA Optimization Flags, Attributes, Pragmas, and Extensions Quick Reference Additional Information Document Revision History for the FPGA Optimization Guide for Intel® oneAPI Toolkits

Introduction To FPGA Design Concepts x

FPGA Architecture Overview Concepts of FPGA Hardware Design Methods of Hardware Design How Source Code Becomes a Custom Hardware Datapath Scheduling Mapping Parallelism Models to FPGA Hardware Memory Types

FPGA Architecture Overview x

Adaptive Logic Module (ALM) Lookup Table (LUT) Register Digital Signal Processing (DSP) Block Random Access Memory (RAM) Blocks

Concepts of FPGA Hardware Design x

Maximum Frequency (fMAX) Latency Pipelining Throughput Datapath Control Path Occupancy

How Source Code Becomes a Custom Hardware Datapath x

Mapping Source Code Instructions to Hardware Mapping Arrays and Their Accesses to Hardware

Scheduling x

Dynamic Scheduling Clustering the Datapath Handshaking Between Clusters

Mapping Parallelism Models to FPGA Hardware x

Data Parallelism Task Parallelism

Data Parallelism x

Executing Independent Operations Simultaneously Pipelining

Memory Types x

Kernel Memory Global Memory

Analyze Your Design x

Analyze the FPGA Early Image Analyze the FPGA Image

Analyze the FPGA Early Image x

Review the FPGA Optimization Report Access HLD FPGA Reports in JSON Format

Review the FPGA Optimization Report x

Loop Analysis Bottlenecks Viewer Area Estimates System Viewer Kernel Memory Viewer Schedule Viewer

Analyze the FPGA Image x

Quartus (Static) Summary Intel® FPGA Dynamic Profiler for DPC++ System-level Profiling Using the Intercept Layer for OpenCL* Applications

Quartus (Static) Summary x

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++ x

Measure Kernel Performance Instrument the Kernel Pipeline with Performance Counters (-Xsprofile) Obtain Profiling Data During Runtime Reduce Area Resource Use While Profiling Profiler Analyses of Example SYCL* Design Scenarios Limitations

Obtain Profiling Data During Runtime x

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data Use Intel® VTune™ Profiler

Use Intel® VTune™ Profiler x

Interpret Performance Counter Data

System-level Profiling Using the Intercept Layer for OpenCL* Applications x

Set Up the Intercept Layer for OpenCL* Applications

Optimize Your Design x

Throughput Resource Use

Throughput x

Single Work-item Kernels NDRange Kernels Memory Accesses Pipes Host

Single Work-item Kernels x

Single Work-item Kernel Design Guidelines Loops Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Loops x

Refactor the Loop-Carried Data Dependency Relax Loop-Carried Dependency Transfer Loop-Carried Dependency to Local Memory Minimize the Memory Dependencies for Loop Pipelining Unroll Loops Fuse Loops to Reduce Overhead and Improve Performance Optimize Loops With Loop Speculation Remove Loop Bottlenecks Shannonization to Improve FMAX/II Optimize Inner Loop Throughput Improve Loop Performance by Caching On-Chip Memory

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels x

Strategies for Inferring the Accumulator

Memory Accesses x

Load-Store Units Global Memory Accesses Optimization Perform Kernel Computations Using Local or Private Memory Local and Private Memory Accesses Optimization Annotating Unified Shared Memory Pointers Zero-Copy Memory Access Additional Recommendations

Load-Store Units x

Load-Store Unit Styles Load-Store Unit Modifiers Load-Store Unit Controls

Global Memory Accesses Optimization x

Global Memory Bandwidth Use Calculation Manual Partition of Global Memory Partitioning Buffers Across Different Memory Types (Heterogeneous Memory) Partitioning Buffers Across Memory Channels of the Same Memory Type Ignoring Dependencies Between Accessor Arguments Contiguous Memory Accesses Static Memory Coalescing

Pipes x

Host Pipes

Host Pipes x

Host Pipe Declaration Host Pipe API Host Pipes RTL Interfaces

Host x

Multi-Threaded Host Application Utilizing Hardware Kernel Invocation Queue Double Buffering Host Utilizing Kernel Invocation Queue N-Way Buffering to Overlap Kernel Execution Prepinning Memory Simple Host-Device Streaming Buffered Host-Device Streaming

Double Buffering Host Utilizing Kernel Invocation Queue x

Applying Double-Buffering Using the Intercept Layer for OpenCL* Applications

Resource Use x

Data Types and Operations Kernel Variable Accesses

Data Types and Operations x

Optimize Floating-point Operation Avoid Expensive Functions Variable-Precision Integer and Floating-Point Support

Variable-Precision Integer and Floating-Point Support x

Advantages and Limitations of Arbitrary Precision Data Types Declare and Use the AC Data Types

Declare and Use the AC Data Types x

Declare the ac_int Data Type Declare the ac_fixed Data Type Declare the ac_complex Data Type Declare the ap_float Data Type

Declare the ap_float Data Type x

Conversion Rules for ap_float Operations with Explicit Precision Controls Comparison Operators Additional ap_float Functions Additional Data Types Provided by the ap_float.hpp Header File Quality of Results and the ap_float Data Type

FPGA Optimization Flags, Attributes, Pragmas, and Extensions x

Optimization Flags Optimization Targets Kernel Variables Kernel Attributes Memory Attributes Loop Directives Floating-Point Pragmas Latency Controls (Beta) FPGA Extensions

Optimization Flags x

Specify Schedule FMAX Target for Kernels (-Xsclock=<clock target>) Create a 2xclock Interface (-Xsuse-2xclock) Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_name>) Force Ring Interconnect for Global Memory (-Xsglobal-ring) Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring) Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder) Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue) Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking) Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion) Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion) Pipeline Loops in Non-task Kernels (-Xsauto-pipeline) Control Semantics of Floating-Point Operations (-fp-model=<value>) Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>) Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<value>) Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<N>) Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<option>) Generate Register Map Wrapper (-Xsregister-map-wrapper-type)

Optimization Targets x

Minimum Latency Flow Maximum Throughput Without Area Optimization Heuristics Flow

Kernel Attributes x

Specify Schedule FMAX Target for Kernels Specify a Workgroup Size Specify Number of SIMD Work Items Omit Hardware that Generates and Dispatches Kernel IDs Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels Reduce Kernel Area and Latency

Loop Directives x

disable_loop_pipelining Attribute initiation_interval Attribute ivdep Attribute loop_coalesce Attribute max_concurrency Attribute max_interleaving Attribute speculated_iterations Attribute unroll Pragma Loop Fuse Functions and nofusion Attribute max_reinvocation_delay Attribute (Beta)

FPGA Extensions x

Pipes Extension Asynchronous Parallelism Within Kernels (task_sequence) device_global Extension (Beta)

Pipes Extension x

Key Properties of a Pipe Accessing Pipes The pipe Class and its Use I/O Pipes Characteristics of Pipes Restrictions of Pipes Guidelines for Designing Pipes Pipe and Atomic Fence

Asynchronous Parallelism Within Kernels (task_sequence) x

Task Functions task_sequence Use Cases

Quick Reference x

Algorithmic C Data Types Floating Point Pragmas FPGA Accessor Properties FPGA Extensions FPGA Kernel Attributes FPGA Local Memory Function Latency Control Properties (Beta) FPGA LSU Controls FPGA Loop Directives FPGA Memory Attributes FPGA Optimization Flags Pipe API

Document Revision History for the FPGA Optimization Guide for Intel® oneAPI Toolkits x

Notices and Disclaimers

FPGA Optimization Guide for Intel® oneAPI Toolkits

Introduction To FPGA Design Concepts

FPGA Architecture Overview

Adaptive Logic Module (ALM)

Lookup Table (LUT)

Digital Signal Processing (DSP) Block

Random Access Memory (RAM) Blocks

Concepts of FPGA Hardware Design

Maximum Frequency (fMAX)

Latency

Pipelining

Throughput

Datapath

Control Path

Occupancy

Methods of Hardware Design

How Source Code Becomes a Custom Hardware Datapath

Mapping Source Code Instructions to Hardware

Mapping Arrays and Their Accesses to Hardware

Scheduling

Dynamic Scheduling

Clustering the Datapath

Handshaking Between Clusters

Mapping Parallelism Models to FPGA Hardware

Data Parallelism

Executing Independent Operations Simultaneously

Pipelining

Task Parallelism

Memory Types

Kernel Memory

Global Memory

Analyze Your Design

Analyze the FPGA Early Image

Review the FPGA Optimization Report

Loop Analysis

Bottlenecks Viewer

Area Estimates

System Viewer

Kernel Memory Viewer

Schedule Viewer

Access HLD FPGA Reports in JSON Format

Analyze the FPGA Image

Quartus (Static) Summary

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++

Measure Kernel Performance

Instrument the Kernel Pipeline with Performance Counters (-Xsprofile)

Obtain Profiling Data During Runtime

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data

Use Intel® VTune™ Profiler

Interpret Performance Counter Data

Reduce Area Resource Use While Profiling

Profiler Analyses of Example SYCL* Design Scenarios

Limitations

System-level Profiling Using the Intercept Layer for OpenCL* Applications

Set Up the Intercept Layer for OpenCL* Applications

Optimize Your Design

Throughput

Single Work-item Kernels

Single Work-item Kernel Design Guidelines

Loops

Refactor the Loop-Carried Data Dependency

Relax Loop-Carried Dependency

Transfer Loop-Carried Dependency to Local Memory

Minimize the Memory Dependencies for Loop Pipelining

Unroll Loops

Fuse Loops to Reduce Overhead and Improve Performance

Optimize Loops With Loop Speculation

Remove Loop Bottlenecks

Shannonization to Improve FMAX/II

Optimize Inner Loop Throughput

Improve Loop Performance by Caching On-Chip Memory

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Strategies for Inferring the Accumulator

NDRange Kernels

Memory Accesses

Load-Store Units

Load-Store Unit Styles

Load-Store Unit Modifiers

Load-Store Unit Controls

Global Memory Accesses Optimization

Global Memory Bandwidth Use Calculation

Manual Partition of Global Memory

Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)

Partitioning Buffers Across Memory Channels of the Same Memory Type

Ignoring Dependencies Between Accessor Arguments

Contiguous Memory Accesses

Static Memory Coalescing

Perform Kernel Computations Using Local or Private Memory

Local and Private Memory Accesses Optimization

Annotating Unified Shared Memory Pointers

Zero-Copy Memory Access

Additional Recommendations

Pipes

Host Pipes

Host Pipe Declaration

Host Pipe API

Host Pipes RTL Interfaces

Host

Multi-Threaded Host Application

Utilizing Hardware Kernel Invocation Queue

Double Buffering Host Utilizing Kernel Invocation Queue

Applying Double-Buffering Using the Intercept Layer for OpenCL* Applications

N-Way Buffering to Overlap Kernel Execution

Prepinning Memory

Simple Host-Device Streaming

Buffered Host-Device Streaming

Resource Use

Data Types and Operations

Optimize Floating-point Operation

Avoid Expensive Functions

Variable-Precision Integer and Floating-Point Support

Advantages and Limitations of Arbitrary Precision Data Types

Declare and Use the AC Data Types

Declare the ac_int Data Type

Declare the ac_fixed Data Type

Declare the ac_complex Data Type

Declare the ap_float Data Type

Conversion Rules for ap_float

Operations with Explicit Precision Controls

Comparison Operators

Additional ap_float Functions

Additional Data Types Provided by the ap_float.hpp Header File

Quality of Results and the ap_float Data Type

Kernel Variable Accesses

FPGA Optimization Flags, Attributes, Pragmas, and Extensions

Optimization Flags

Specify Schedule FMAX Target for Kernels (-Xsclock=<clock target>)

Create a 2xclock Interface (-Xsuse-2xclock)

Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_name>)

Force Ring Interconnect for Global Memory (-Xsglobal-ring)

Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring)

Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder)

Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue)

Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking)

Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion)

Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion)

Pipeline Loops in Non-task Kernels (-Xsauto-pipeline)

Control Semantics of Floating-Point Operations (-fp-model=<value>)

Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>)

Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<value>)

Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<N>)

Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<option>)

Generate Register Map Wrapper (-Xsregister-map-wrapper-type)

Optimization Targets

Minimum Latency Flow

Maximum Throughput Without Area Optimization Heuristics Flow

Kernel Variables

Kernel Attributes

Specify Schedule FMAX Target for Kernels

Specify a Workgroup Size

Specify Number of SIMD Work Items

Omit Hardware that Generates and Dispatches Kernel IDs

Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels

Reduce Kernel Area and Latency

Memory Attributes

Loop Directives

disable_loop_pipelining Attribute

initiation_interval Attribute

ivdep Attribute

loop_coalesce Attribute

max_concurrency Attribute

max_interleaving Attribute

speculated_iterations Attribute

unroll Pragma

Loop Fuse Functions and nofusion Attribute

max_reinvocation_delay Attribute (Beta)

Floating-Point Pragmas

Latency Controls (Beta)

FPGA Extensions

Pipes Extension

Key Properties of a Pipe

Accessing Pipes

The pipe Class and its Use

I/O Pipes

Characteristics of Pipes

Restrictions of Pipes

Guidelines for Designing Pipes

Pipe and Atomic Fence

Asynchronous Parallelism Within Kernels (task_sequence)

Task Functions

task_sequence Use Cases

device_global Extension (Beta)

Quick Reference

Algorithmic C Data Types

Floating Point Pragmas

FPGA Accessor Properties

FPGA Extensions

FPGA Kernel Attributes

FPGA Local Memory Function

Latency Control Properties (Beta)

FPGA LSU Controls

FPGA Loop Directives

FPGA Memory Attributes

FPGA Optimization Flags

Pipe API

Additional Information

Document Revision History for the FPGA Optimization Guide for Intel® oneAPI Toolkits

Notices and Disclaimers

Contiguous Memory Accesses

The Intel® oneAPI DPC++/C++ Compiler attempts to dynamically coalesce accesses to adjacent memory locations to improve global memory efficiency. This is effective if consecutive work items access consecutive memory locations in a given load or store operation. The same is true in a single_task invocation if consecutive loop iterations access consecutive memory locations.

Consider the following code example:

q.submit([&](handler &cgh) {
  accessor a(a_buf, cgh, read_only); 
  accessor b(b_buf, cgh, read_only);
  accessor c(c_buf, cgh, write_only, no_init);
  cgh.parallel_for<class SimpleVadd>(N, [=](id<1> ID) {
    c[ID] = a[ID] + b[ID];
  });
});

The load operation from the accessor a uses an index that is a direct function of the work-item global ID. By basing the accessor index on the work-item global ID, the Intel® oneAPI DPC++/C++ Compiler can ensure contiguous load operations. These load operations retrieve the data sequentially from the input array and send the read data to the pipeline as required. Contiguous store operations then store elements of the result that exits the computation pipeline in sequential locations within global memory.

The following figure illustrates an example of the contiguous memory access optimization:

Contiguous Memory Access

Contiguous load and store operations improve memory access efficiency because they lead to increased access speeds and reduced hardware resource needs. The data travels in and out of the computational portion of the pipeline concurrently, allowing overlaps between computation and memory accesses. Where possible, use work-item IDs that index accesses to arrays in global memory to maximize memory bandwidth efficiency.

Parent topic: Global Memory Accesses Optimization

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA Optimization Guide for Intel® oneAPI Toolkits

Contiguous Memory Accesses