Minimize the Memory Dependencies for Loop Pipelining

Developer Guide

Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs

Download PDF

ID 785441

Date 5/08/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs Introduction To FPGA Design Concepts Intel oneAPI FPGA Development Getting Started with the Intel oneAPI DPC++/C++ Compiler for Intel FPGA Development Defining a Kernel for FPGAs Debugging and Verifying Your Design Analyzing Your Design Optimizing Your Kernel Optimizing Your Host Application Integrating Your Kernel into DSP Builder for Intel FPGAs Integrating Your RTL IP Core Into a System RTL IP Core Kernel Interfaces Loops Pipes Data Types and Arithmetic Operations Parallelism Memories and Memory Operations Libraries Additional FPGA Acceleration Flow Considerations Additional SYCL* HLS Flow Considerations FPGA Optimization Flags, Attributes, Pragmas, and Extensions Quick Reference Additional Information Document Revision History for the Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs Notices and Disclaimers

Introduction To FPGA Design Concepts x

FPGA Architecture Overview Concepts of FPGA Hardware Design How Source Code Becomes a Custom Hardware Datapath

How Source Code Becomes a Custom Hardware Datapath x

Mapping Source Code Instructions to Hardware Scheduling Mapping Parallelism Models to FPGA Hardware Memory Types

Intel oneAPI FPGA Development x

FPGA Flow Terminology Intel oneAPI FPGA Development Flow Types of SYCL* FPGA Compilation FPGA Compilation Flags FPGA Workflows in IDEs

Types of SYCL* FPGA Compilation x

Separating Device and Host Code Compilation

Getting Started with the Intel oneAPI DPC++/C++ Compiler for Intel FPGA Development x

Installing the Intel oneAPI FPGA Development Environment FPGA Development for Intel oneAPI Toolkits with Visual Studio* Code

Installing the Intel oneAPI FPGA Development Environment x

Simulation Prerequisites Installing the Questa*-Intel FPGA Edition Software Set Up the Simulation Environment

FPGA Development for Intel oneAPI Toolkits with Visual Studio* Code x

Set the Environment Variables and Launch Visual Studio* Code Create an FPGA Visual Studio* Code Project Enable Code Completion in a Visual Studio* Code Project Configure Running and Debugging in a Visual Studio* Code Project Debugging Your Kernel in Visual Studio* Code with a Native Debugger Generate and View the FPGA Optimization Report Build and Run the FPGA Hardware Image

Defining a Kernel for FPGAs x

Suggested Kernel Coding Styles Single Work-Item Kernels

Single Work-Item Kernels x

Single Work-item Kernel Design Guidelines Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels x

Strategies for Inferring the Accumulator

Debugging and Verifying Your Design x

Device Selectors for FPGA printf Command Restrictions Emulate and Debug Your Design Evaluate Your Kernel Through Simulation

Emulate and Debug Your Design x

Emulator Environment Variables Compile and Emulate Your Design Limitations of the Emulator Troubleshooting Discrepancies in Hardware and Emulator Results Emulator Known Issues

Evaluate Your Kernel Through Simulation x

Debug During Verification Compile a Kernel for Simulation Simulate Your Kernel Viewing Simulation Waveforms Troubleshooting Simulator Issues

Analyzing Your Design x

Review the FPGA Optimization Report Quartus (Static) Summary Intel® FPGA Dynamic Profiler for DPC++

Review the FPGA Optimization Report x

Loop Analysis Bottlenecks Viewer Area Estimates System Viewer Kernel Memory Viewer Schedule Viewer Access FPGA Optimization Reports in JSON Format

Quartus (Static) Summary x

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++ x

Measure Kernel Performance Instrument the Kernel Pipeline with Performance Counters (-Xsprofile) Obtain Profiling Data During Run Time Reduce Area Resource Use While Profiling Profiler Analyses of Example SYCL* Design Scenarios Limitations

Obtain Profiling Data During Run Time x

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data Use Intel® VTune™ Profiler

Use Intel® VTune™ Profiler x

Interpret Performance Counter Data

Optimizing Your Host Application x

Throughput Resource Use System-level Profiling Using the Intercept Layer for OpenCL™ Applications Multi-Threaded Host Application Utilizing Hardware Kernel Invocation Queue Double Buffering Host Utilizing Kernel Invocation Queue N-Way Buffering to Overlap Kernel Execution Prepinning Memory Simple Host-Device Streaming Buffered Host-Device Streaming

System-level Profiling Using the Intercept Layer for OpenCL™ Applications x

Set Up the Intercept Layer for OpenCL™ Applications

Double Buffering Host Utilizing Kernel Invocation Queue x

Analyzing Buffering Using the Intercept Layer for OpenCL™ Applications

Integrating Your RTL IP Core Into a System x

Synthesize Your RTL IP Core with Quartus Prime Software Encrypting RTL IP Cores Add an RTL IP Core into a Platform Designer System Add an RTL IP Core into a Quartus Prime Project

Encrypting RTL IP Cores x

Encrypting RTL IP Cores Without Licensing Encrypting RTL IP Core With Licensing

RTL IP Core Kernel Interfaces x

Kernel Argument Interfaces Memory-Mapped (MM) Agent Kernel Invocation Interface Ready/Valid Handshaking Kernel Invocation Interface Memory-Mapped Host Interfaces

Memory-Mapped Host Interfaces x

Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the annotated_arg Class Memory-Mapped Host Interfaces Using Accessors

Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the annotated_arg Class x

Buffer Locations in Memory-Mapped Host Interfaces The annotated_arg Template Class

Loops x

Refactor the Loop-Carried Data Dependency Relax Loop-Carried Dependency Transfer Loop-Carried Dependency to Local Memory Minimize the Memory Dependencies for Loop Pipelining Unroll Loops Fuse Loops to Reduce Overhead and Improve Performance Optimize Loops With Loop Speculation Remove Loop Bottlenecks Improve fMAX/II with Shannonization Optimize Inner Loop Throughput Improve Loop Performance by Caching Data in On-Chip Memory

Pipes x

Host Pipes Pipes Extension Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe

Host Pipes x

Host Pipe Declaration Host Pipe API Host Pipes RTL Interfaces

Pipes Extension x

Key Properties of a Pipe Accessing Pipes The pipe Class and its Use I/O Pipes Characteristics of Pipes Restrictions of Pipes Guidelines for Designing Pipes The atomic_fence Function and Pipes

Data Types and Arithmetic Operations x

Optimize Floating-point Operation Avoid Expensive Functions Variable-Precision Data Type Support

Variable-Precision Data Type Support x

Advantages and Limitations of Variable Precision Data Types Declare and Use the AC Data Types

Declare and Use the AC Data Types x

Declare the ac_int Data Type Declare the ac_fixed Data Type Declare the ac_complex Data Type Declare the ap_float Data Type

Declare the ap_float Data Type x

Conversion Rules for the ap_float Data Type Operations with Explicit Precision Controls Comparison Operators Additional ap_float Functions Additional Data Types Provided by the ap_float.hpp Header File Quality of Results and the ap_float Data Type

Parallelism x

Pipelined Kernels NDRange Kernels Asynchronous Parallelism Within Kernels (task_sequence)

Pipelined Kernels x

Stable Arguments

Asynchronous Parallelism Within Kernels (task_sequence) x

Task Functions task_sequence Use Cases

Memories and Memory Operations x

The device_global Extension (Beta) The annotated_ptr Template Class (Beta) Memory Accesses Kernel Variable Accesses

Memory Accesses x

Load-Store Units Global Memory Accesses Optimization Perform Kernel Computations Using Local or Private Memory Local and Private Memory Accesses Optimization Annotating Unified Shared Memory Pointers Zero-Copy Memory Access Additional Recommendations

Load-Store Units x

Load-Store Unit Styles Load-Store Unit Modifiers Load-Store Unit Controls

Global Memory Accesses Optimization x

Global Memory Bandwidth Use Calculation Manual Partition of Global Memory Partitioning Buffers Across Different Memory Types (Heterogeneous Memory) Partitioning Buffers Across Memory Channels of the Same Memory Type Ignoring Dependencies Between Accessor Arguments Contiguous Memory Accesses Static Memory Coalescing

Libraries x

Use SYCL Shared Library With Third-Party Applications Use of RTL Libraries for FPGA Object Manifest File Syntax of an RTL Library Restrictions and Limitations in RTL Support Agilex 7 and Stratix 10 Design-Specific Reset Requirements for Stall-Free and Stallable RTL Libraries Stall-Free RTL

Additional FPGA Acceleration Flow Considerations x

FPGA-CPU Interaction FPGA BSPs and Boards Targeting Multiple Homogeneous FPGA Devices Targeting Multiple Platforms Split Kernel into Multiple FPGA Images (Linux only)

FPGA BSPs and Boards x

FPGA Board Initialization Obtain FPGA Hardware Image Information Extracting the FPGA Hardware Configuration (.aocx) File from a Multiarchitecture Binary File

Additional SYCL* HLS Flow Considerations x

IP core Reset Behavior

FPGA Optimization Flags, Attributes, Pragmas, and Extensions x

Optimization Flags Optimization Targets Kernel Variables Kernel Attributes Memory Attributes Loop Directives Floating-Point Pragmas Latency Controls (Beta)

Optimization Flags x

Specify Schedule fMAX Target for Kernels (-Xsclock=<clock target>) Create a 2xclock Interface (-Xsuse-2xclock) Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_name>) Force Ring Interconnect for Global Memory (-Xsglobal-ring) Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring) Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder) Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue) Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking) Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion) Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion) Pipeline Loops in Non-task Kernels (-Xsauto-pipeline) Control Semantics of Floating-Point Operations (-fp-model=<value>) Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>) Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<value>) Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<N>) Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<option>) Generate Register Map Wrapper (-Xsregister-map-wrapper-type) Allow Wide Memory Initialization (-Xsallow-wide-mif)

Optimization Targets x

Minimum Latency Flow Minimum Area Flow Balanced Throughput-Area Trade-Offs Flow

Kernel Attributes x

Specify Schedule FMAX Target for Kernels Specify a Workgroup Size Specify Number of SIMD Work Items Omit Hardware that Generates and Dispatches Kernel IDs Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels Reduce Kernel Area and Latency

Loop Directives x

disable_loop_pipelining Attribute initiation_interval Attribute ivdep Attribute loop_coalesce Attribute max_concurrency Attribute max_interleaving Attribute speculated_iterations Attribute unroll Pragma Loop Fuse Functions and nofusion Attribute max_reinvocation_delay Attribute (Beta)

Quick Reference x

Algorithmic C Data Types Floating Point Pragmas FPGA Accessor Properties FPGA Extensions FPGA Kernel Attributes FPGA Local Memory Function Latency Control Properties (Beta) FPGA LSU Controls FPGA Loop Directives FPGA Memory Attributes FPGA Optimization Flags Pipe API

Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs

Introduction To FPGA Design Concepts

FPGA Architecture Overview

Concepts of FPGA Hardware Design

How Source Code Becomes a Custom Hardware Datapath

Mapping Source Code Instructions to Hardware

Scheduling

Mapping Parallelism Models to FPGA Hardware

Memory Types

Intel oneAPI FPGA Development

FPGA Flow Terminology

Intel oneAPI FPGA Development Flow

Types of SYCL* FPGA Compilation

Separating Device and Host Code Compilation

FPGA Compilation Flags

FPGA Workflows in IDEs

Getting Started with the Intel oneAPI DPC++/C++ Compiler for Intel FPGA Development

Installing the Intel oneAPI FPGA Development Environment

Simulation Prerequisites

Installing the Questa*-Intel FPGA Edition Software

Set Up the Simulation Environment

FPGA Development for Intel oneAPI Toolkits with Visual Studio* Code

Set the Environment Variables and Launch Visual Studio* Code

Create an FPGA Visual Studio* Code Project

Enable Code Completion in a Visual Studio* Code Project

Configure Running and Debugging in a Visual Studio* Code Project

Debugging Your Kernel in Visual Studio* Code with a Native Debugger

Generate and View the FPGA Optimization Report

Build and Run the FPGA Hardware Image

Defining a Kernel for FPGAs

Suggested Kernel Coding Styles

Single Work-Item Kernels

Single Work-item Kernel Design Guidelines

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Strategies for Inferring the Accumulator

Debugging and Verifying Your Design

Device Selectors for FPGA

printf Command Restrictions

Emulate and Debug Your Design

Emulator Environment Variables

Compile and Emulate Your Design

Limitations of the Emulator

Troubleshooting Discrepancies in Hardware and Emulator Results

Emulator Known Issues

Evaluate Your Kernel Through Simulation

Debug During Verification

Compile a Kernel for Simulation

Simulate Your Kernel

Viewing Simulation Waveforms

Troubleshooting Simulator Issues

Analyzing Your Design

Review the FPGA Optimization Report

Loop Analysis

Bottlenecks Viewer

Area Estimates

System Viewer

Kernel Memory Viewer

Schedule Viewer

Access FPGA Optimization Reports in JSON Format

Quartus (Static) Summary

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++

Measure Kernel Performance

Instrument the Kernel Pipeline with Performance Counters (-Xsprofile)

Obtain Profiling Data During Run Time

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data

Use Intel® VTune™ Profiler

Interpret Performance Counter Data

Reduce Area Resource Use While Profiling

Profiler Analyses of Example SYCL* Design Scenarios

Limitations

Optimizing Your Kernel

Optimizing Your Host Application

Throughput

Resource Use

System-level Profiling Using the Intercept Layer for OpenCL™ Applications

Set Up the Intercept Layer for OpenCL™ Applications

Multi-Threaded Host Application

Utilizing Hardware Kernel Invocation Queue

Double Buffering Host Utilizing Kernel Invocation Queue

Analyzing Buffering Using the Intercept Layer for OpenCL™ Applications

N-Way Buffering to Overlap Kernel Execution

Prepinning Memory

Simple Host-Device Streaming

Buffered Host-Device Streaming

Integrating Your Kernel into DSP Builder for Intel FPGAs

Integrating Your RTL IP Core Into a System

Synthesize Your RTL IP Core with Quartus Prime Software

Encrypting RTL IP Cores

Encrypting RTL IP Cores Without Licensing

Encrypting RTL IP Core With Licensing

Add an RTL IP Core into a Platform Designer System

Add an RTL IP Core into a Quartus Prime Project

RTL IP Core Kernel Interfaces

Kernel Argument Interfaces

Memory-Mapped (MM) Agent Kernel Invocation Interface

Ready/Valid Handshaking Kernel Invocation Interface

Memory-Mapped Host Interfaces

Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the annotated_arg Class

Buffer Locations in Memory-Mapped Host Interfaces

The annotated_arg Template Class

Memory-Mapped Host Interfaces Using Accessors

Loops

Refactor the Loop-Carried Data Dependency

Relax Loop-Carried Dependency

Transfer Loop-Carried Dependency to Local Memory

Minimize the Memory Dependencies for Loop Pipelining

Unroll Loops

Fuse Loops to Reduce Overhead and Improve Performance

Optimize Loops With Loop Speculation

Remove Loop Bottlenecks

Improve fMAX/II with Shannonization

Optimize Inner Loop Throughput

Improve Loop Performance by Caching Data in On-Chip Memory

Pipes

Host Pipes

Host Pipe Declaration

Host Pipe API

Host Pipes RTL Interfaces

Pipes Extension

Key Properties of a Pipe

Accessing Pipes

The pipe Class and its Use

I/O Pipes

Characteristics of Pipes

Restrictions of Pipes

Guidelines for Designing Pipes

The atomic_fence Function and Pipes

Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe

Data Types and Arithmetic Operations

Optimize Floating-point Operation

Avoid Expensive Functions

Variable-Precision Data Type Support

Advantages and Limitations of Variable Precision Data Types

Declare and Use the AC Data Types

Declare the ac_int Data Type

Declare the ac_fixed Data Type

Declare the ac_complex Data Type

Declare the ap_float Data Type

Conversion Rules for the ap_float Data Type

Operations with Explicit Precision Controls

Comparison Operators

Additional ap_float Functions

Additional Data Types Provided by the ap_float.hpp Header File

Quality of Results and the ap_float Data Type

Parallelism

Pipelined Kernels

Stable Arguments

NDRange Kernels

Asynchronous Parallelism Within Kernels (task_sequence)

Task Functions

task_sequence Use Cases

Memories and Memory Operations

The device_global Extension (Beta)

The annotated_ptr Template Class (Beta)

Memory Accesses

Load-Store Units

Load-Store Unit Styles

Load-Store Unit Modifiers

Load-Store Unit Controls

Global Memory Accesses Optimization

Global Memory Bandwidth Use Calculation

Manual Partition of Global Memory

Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)

Partitioning Buffers Across Memory Channels of the Same Memory Type

Ignoring Dependencies Between Accessor Arguments

Contiguous Memory Accesses

Static Memory Coalescing

Perform Kernel Computations Using Local or Private Memory

Local and Private Memory Accesses Optimization

Annotating Unified Shared Memory Pointers

Zero-Copy Memory Access

Additional Recommendations

Kernel Variable Accesses

Libraries

Use SYCL Shared Library With Third-Party Applications

Use of RTL Libraries for FPGA

Object Manifest File Syntax of an RTL Library

Restrictions and Limitations in RTL Support

Agilex 7 and Stratix 10 Design-Specific Reset Requirements for Stall-Free and Stallable RTL Libraries

Stall-Free RTL

Additional FPGA Acceleration Flow Considerations

FPGA-CPU Interaction

FPGA BSPs and Boards

FPGA Board Initialization

Obtain FPGA Hardware Image Information

Extracting the FPGA Hardware Configuration (.aocx) File from a Multiarchitecture Binary File

Targeting Multiple Homogeneous FPGA Devices

Targeting Multiple Platforms

Split Kernel into Multiple FPGA Images (Linux only)

Additional SYCL* HLS Flow Considerations

IP core Reset Behavior

FPGA Optimization Flags, Attributes, Pragmas, and Extensions

Optimization Flags

Specify Schedule fMAX Target for Kernels (-Xsclock=<clock target>)

Create a 2xclock Interface (-Xsuse-2xclock)

Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_name>)

Force Ring Interconnect for Global Memory (-Xsglobal-ring)

Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring)

Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder)

Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue)

Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking)

Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion)

Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion)

Pipeline Loops in Non-task Kernels (-Xsauto-pipeline)

Control Semantics of Floating-Point Operations (-fp-model=<value>)

Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>)

Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<value>)

Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<N>)

Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<option>)

Generate Register Map Wrapper (-Xsregister-map-wrapper-type)

Allow Wide Memory Initialization (-Xsallow-wide-mif)

Optimization Targets

Minimum Latency Flow

Minimum Area Flow

Balanced Throughput-Area Trade-Offs Flow

Kernel Variables

Kernel Attributes

Specify Schedule FMAX Target for Kernels

Specify a Workgroup Size

Specify Number of SIMD Work Items

Omit Hardware that Generates and Dispatches Kernel IDs

Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels

Reduce Kernel Area and Latency

Memory Attributes

Loop Directives

disable_loop_pipelining Attribute

initiation_interval Attribute

ivdep Attribute

loop_coalesce Attribute

max_concurrency Attribute

max_interleaving Attribute

speculated_iterations Attribute

unroll Pragma

Loop Fuse Functions and nofusion Attribute

max_reinvocation_delay Attribute (Beta)

Floating-Point Pragmas

Latency Controls (Beta)

Quick Reference

Algorithmic C Data Types

Floating Point Pragmas

FPGA Accessor Properties

FPGA Extensions

FPGA Kernel Attributes

FPGA Local Memory Function

Latency Control Properties (Beta)

FPGA LSU Controls

FPGA Loop Directives

FPGA Memory Attributes

FPGA Optimization Flags

Pipe API

Additional Information

Document Revision History for the Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs

Notices and Disclaimers

Minimize the Memory Dependencies for Loop Pipelining

Intel® oneAPI DPC++/C++ Compiler ensures that the memory accesses from the same thread respects the program order. When you compile an NDRange kernel, use barriers to synchronize memory accesses across threads in the same workgroup.

Loop dependencies might introduce bottlenecks for single work-item kernels due to latency associated with the memory accesses. The Intel® oneAPI DPC++/C++ Compiler defers a memory operation until a dependent memory operation completes. This could affect the loop initiation interval (II). The Intel® oneAPI DPC++/C++ Compiler indicates the memory dependencies in the optimization report.

To minimize the impact of memory dependencies for loop pipelining:

Ensure that the Intel® oneAPI DPC++/C++ Compiler does not assume false dependencies.
When the static memory dependence analysis fails to prove that dependency does not exist, the Intel® oneAPI DPC++/C++ Compiler assumes that a dependency exists and modifies the kernel execution to enforce the dependency. The impact of the dependency enforcement is lower if the memory system is stall-free.
- Write-after-read operations with data dependency on a load-store unit can take just two clock cycles (II=2). Other stall-free scenarios can take up to seven clock cycles.
- The Intel® oneAPI DPC++/C++ Compiler can fully resolve the read-after-write (control dependency) operation.
Override the static memory dependence analysis by adding the line [[intel::ivdep]] before the loop in your kernel code if you are sure that it carries no dependencies. For more information, refer to ivdep Attribute

Parent topic: Loops

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs

Minimize the Memory Dependencies for Loop Pipelining