Measure Kernel Performance

Developer Guide

Intel oneAPI FPGA Handbook

Download PDF

ID 785441

Date 2/07/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-D241ADFB-4A7D-4F02-808D-5F4F88B7B17C

View Details

Document Table of Contents

Document Table of Contents x

Intel® oneAPI FPGA Handbook Introduction To FPGA Design Concepts Intel oneAPI FPGA Development Defining a Kernel for FPGAs Debugging and Verifying Your Design Analyzing Your Design Optimizing Your Kernel Optimizing Your Host Application Integrating Your RTL IP Core Into a System RTL IP Core Kernel Interfaces Loops Pipes Data Types and Arithmetic Operations Parallelism Memories and Memory Operations Libraries Additional FPGA Acceleration Flow Considerations Additional SYCL* HLS Flow Considerations FPGA Optimization Flags, Attributes, Pragmas, and Extensions Quick Reference Additional Information Document Revision History for the Intel oneAPI FPGA Handbook Notices and Disclaimers

Introduction To FPGA Design Concepts x

FPGA Architecture Overview Concepts of FPGA Hardware Design How Source Code Becomes a Custom Hardware Datapath

How Source Code Becomes a Custom Hardware Datapath x

Mapping Source Code Instructions to Hardware Scheduling Mapping Parallelism Models to FPGA Hardware Memory Types

Intel oneAPI FPGA Development x

FPGA Flow Terminology Intel oneAPI FPGA Development Flow Types of SYCL* FPGA Compilation FPGA Compilation Flags FPGA Workflows in IDEs

Types of SYCL* FPGA Compilation x

Separating Device and Host Code Compilation

Defining a Kernel for FPGAs x

Suggested Kernel Coding Styles Single Work-Item Kernels

Single Work-Item Kernels x

Single Work-item Kernel Design Guidelines Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels x

Strategies for Inferring the Accumulator

Debugging and Verifying Your Design x

Device Selectors for FPGA printf Command Restrictions Emulate and Debug Your Design Evaluate Your Kernel Through Simulation

Emulate and Debug Your Design x

Emulator Environment Variables Compile and Emulate Your Design Limitations of the Emulator Troubleshooting Discrepancies in Hardware and Emulator Results Emulator Known Issues

Evaluate Your Kernel Through Simulation x

Debug During Verification Simulation Prerequisites Installing the Questa*-Intel FPGA Edition Software Set Up the Simulation Environment Compile a Kernel for Simulation Simulate Your Kernel Viewing Simulation Waveforms Troubleshooting Simulator Issues

Analyzing Your Design x

Analyze the FPGA Early Image Analyze the FPGA Image Review the FPGA Optimization Report Quartus (Static) Summary Intel® FPGA Dynamic Profiler for DPC++ System-level Profiling Using the Intercept Layer for OpenCL™ Applications

Review the FPGA Optimization Report x

Loop Analysis Bottlenecks Viewer Area Estimates System Viewer Kernel Memory Viewer Schedule Viewer Access HLD FPGA Reports in JSON Format

Quartus (Static) Summary x

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++ x

Measure Kernel Performance Instrument the Kernel Pipeline with Performance Counters (-Xsprofile) Obtain Profiling Data During Run Time Reduce Area Resource Use While Profiling Profiler Analyses of Example SYCL* Design Scenarios Limitations

Obtain Profiling Data During Run Time x

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data Use Intel® VTune™ Profiler

Use Intel® VTune™ Profiler x

Interpret Performance Counter Data

System-level Profiling Using the Intercept Layer for OpenCL™ Applications x

Set Up the Intercept Layer for OpenCL™ Applications

Optimizing Your Host Application x

Throughput Resource Use Multi-Threaded Host Application Utilizing Hardware Kernel Invocation Queue Double Buffering Host Utilizing Kernel Invocation Queue N-Way Buffering to Overlap Kernel Execution Prepinning Memory Simple Host-Device Streaming Buffered Host-Device Streaming

Double Buffering Host Utilizing Kernel Invocation Queue x

Applying Double-Buffering Using the Intercept Layer for OpenCL™ Applications

Integrating Your RTL IP Core Into a System x

Synthesize Your RTL IP Core with Intel Quartus Prime Software Add RTL IP Component into a Platform Designer System Add an RTL IP Component into an Intel® Quartus® Prime Project Encrypt RTL IP Components for Distribution

RTL IP Core Kernel Interfaces x

Kernel Argument Interfaces Memory-Mapped (MM) Agent Kernel Invocation Interface Ready/Valid Handshaking Kernel Invocation Interface Memory-Mapped Host Interfaces

Memory-Mapped Host Interfaces x

Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the annotated_arg Class Memory-Mapped Host Interfaces Using Accessors (Deprecated) Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the mmhost Macro

Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the annotated_arg Class x

Buffer Locations in Memory-Mapped Host Interfaces The annotated_arg Template Class

Loops x

Refactor the Loop-Carried Data Dependency Relax Loop-Carried Dependency Transfer Loop-Carried Dependency to Local Memory Minimize the Memory Dependencies for Loop Pipelining Unroll Loops Fuse Loops to Reduce Overhead and Improve Performance Optimize Loops With Loop Speculation Remove Loop Bottlenecks Improve fMAX/II with Shannonization Optimize Inner Loop Throughput Improve Loop Performance by Caching Data in On-Chip Memory

Pipes x

Host Pipes Pipes Extension Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe

Host Pipes x

Host Pipe Declaration Host Pipe API Host Pipes RTL Interfaces

Pipes Extension x

Key Properties of a Pipe Accessing Pipes The pipe Class and its Use I/O Pipes Characteristics of Pipes Restrictions of Pipes Guidelines for Designing Pipes The atomic_fence Function and Pipes

Data Types and Arithmetic Operations x

Optimize Floating-point Operation Avoid Expensive Functions Variable-Precision Data Type Support

Variable-Precision Data Type Support x

Advantages and Limitations of Variable Precision Data Types Declare and Use the AC Data Types

Declare and Use the AC Data Types x

Declare the ac_int Data Type Declare the ac_fixed Data Type Declare the ac_complex Data Type Declare the ap_float Data Type

Declare the ap_float Data Type x

Conversion Rules for the ap_float Data Type Operations with Explicit Precision Controls Comparison Operators Additional ap_float Functions Additional Data Types Provided by the ap_float.hpp Header File Quality of Results and the ap_float Data Type

Parallelism x

Pipelined Kernels NDRange Kernels Asynchronous Parallelism Within Kernels (task_sequence)

Pipelined Kernels x

Stable Arguments

Asynchronous Parallelism Within Kernels (task_sequence) x

Task Functions task_sequence Use Cases

Memories and Memory Operations x

The device_global Extension (Beta) Memory Accesses Kernel Variable Accesses

Memory Accesses x

Load-Store Units Global Memory Accesses Optimization Perform Kernel Computations Using Local or Private Memory Local and Private Memory Accesses Optimization Annotating Unified Shared Memory Pointers Zero-Copy Memory Access Additional Recommendations

Load-Store Units x

Load-Store Unit Styles Load-Store Unit Modifiers Load-Store Unit Controls

Global Memory Accesses Optimization x

Global Memory Bandwidth Use Calculation Manual Partition of Global Memory Partitioning Buffers Across Different Memory Types (Heterogeneous Memory) Partitioning Buffers Across Memory Channels of the Same Memory Type Ignoring Dependencies Between Accessor Arguments Contiguous Memory Accesses Static Memory Coalescing

Libraries x

Use SYCL Shared Library With Third-Party Applications Use of RTL Libraries for FPGA Object Manifest File Syntax of an RTL Library Restrictions and Limitations in RTL Support Intel® Stratix® 10 and Intel Agilex® 7 Design-Specific Reset Requirements for Stall-Free and Stallable RTL Libraries Stall-Free RTL

Additional FPGA Acceleration Flow Considerations x

FPGA-CPU Interaction FPGA BSPs and Boards Targeting Multiple Homogeneous FPGA Devices Targeting Multiple Platforms Split Kernel into Multiple FPGA Images (Linux only)

FPGA BSPs and Boards x

FPGA Board Initialization Obtain FPGA Hardware Image Information Extracting the FPGA Hardware Configuration (.aocx) File from a Multiarchitecture Binary File

Additional SYCL* HLS Flow Considerations x

IP Component Reset Behavior

FPGA Optimization Flags, Attributes, Pragmas, and Extensions x

Optimization Flags Optimization Targets Kernel Variables Kernel Attributes Memory Attributes Loop Directives Floating-Point Pragmas Latency Controls (Beta)

Optimization Flags x

Specify Schedule FMAX Target for Kernels (-Xsclock=<clock target>) Create a 2xclock Interface (-Xsuse-2xclock) Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_name>) Force Ring Interconnect for Global Memory (-Xsglobal-ring) Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring) Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder) Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue) Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking) Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion) Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion) Pipeline Loops in Non-task Kernels (-Xsauto-pipeline) Control Semantics of Floating-Point Operations (-fp-model=<value>) Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>) Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<value>) Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<N>) Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<option>) Generate Register Map Wrapper (-Xsregister-map-wrapper-type)

Optimization Targets x

Minimum Latency Flow Balanced Throughput-Area Trade-Offs Flow

Kernel Attributes x

Specify Schedule FMAX Target for Kernels Specify a Workgroup Size Specify Number of SIMD Work Items Omit Hardware that Generates and Dispatches Kernel IDs Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels Reduce Kernel Area and Latency

Loop Directives x

disable_loop_pipelining Attribute initiation_interval Attribute ivdep Attribute loop_coalesce Attribute max_concurrency Attribute max_interleaving Attribute speculated_iterations Attribute unroll Pragma Loop Fuse Functions and nofusion Attribute max_reinvocation_delay Attribute (Beta)

Quick Reference x

Algorithmic C Data Types Floating Point Pragmas FPGA Accessor Properties FPGA Extensions FPGA Kernel Attributes FPGA Local Memory Function Latency Control Properties (Beta) FPGA LSU Controls FPGA Loop Directives FPGA Memory Attributes FPGA Optimization Flags Pipe API

Intel® oneAPI FPGA Handbook

Introduction To FPGA Design Concepts

FPGA Architecture Overview

Concepts of FPGA Hardware Design

How Source Code Becomes a Custom Hardware Datapath

Mapping Source Code Instructions to Hardware

Scheduling

Mapping Parallelism Models to FPGA Hardware

Memory Types

Intel oneAPI FPGA Development

FPGA Flow Terminology

Intel oneAPI FPGA Development Flow

Types of SYCL* FPGA Compilation

Separating Device and Host Code Compilation

FPGA Compilation Flags

FPGA Workflows in IDEs

Defining a Kernel for FPGAs

Suggested Kernel Coding Styles

Single Work-Item Kernels

Single Work-item Kernel Design Guidelines

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Strategies for Inferring the Accumulator

Debugging and Verifying Your Design

Device Selectors for FPGA

printf Command Restrictions

Emulate and Debug Your Design

Emulator Environment Variables

Compile and Emulate Your Design

Limitations of the Emulator

Troubleshooting Discrepancies in Hardware and Emulator Results

Emulator Known Issues

Evaluate Your Kernel Through Simulation

Debug During Verification

Simulation Prerequisites

Installing the Questa*-Intel FPGA Edition Software

Set Up the Simulation Environment

Compile a Kernel for Simulation

Simulate Your Kernel

Viewing Simulation Waveforms

Troubleshooting Simulator Issues

Analyzing Your Design

Analyze the FPGA Early Image

Analyze the FPGA Image

Review the FPGA Optimization Report

Loop Analysis

Bottlenecks Viewer

Area Estimates

System Viewer

Kernel Memory Viewer

Schedule Viewer

Access HLD FPGA Reports in JSON Format

Quartus (Static) Summary

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++

Measure Kernel Performance

Instrument the Kernel Pipeline with Performance Counters (-Xsprofile)

Obtain Profiling Data During Run Time

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data

Use Intel® VTune™ Profiler

Interpret Performance Counter Data

Reduce Area Resource Use While Profiling

Profiler Analyses of Example SYCL* Design Scenarios

Limitations

System-level Profiling Using the Intercept Layer for OpenCL™ Applications

Set Up the Intercept Layer for OpenCL™ Applications

Optimizing Your Kernel

Optimizing Your Host Application

Throughput

Resource Use

Multi-Threaded Host Application

Utilizing Hardware Kernel Invocation Queue

Double Buffering Host Utilizing Kernel Invocation Queue

Applying Double-Buffering Using the Intercept Layer for OpenCL™ Applications

N-Way Buffering to Overlap Kernel Execution

Prepinning Memory

Simple Host-Device Streaming

Buffered Host-Device Streaming

Integrating Your RTL IP Core Into a System

Synthesize Your RTL IP Core with Intel Quartus Prime Software

Add RTL IP Component into a Platform Designer System

Add an RTL IP Component into an Intel® Quartus® Prime Project

Encrypt RTL IP Components for Distribution

RTL IP Core Kernel Interfaces

Kernel Argument Interfaces

Memory-Mapped (MM) Agent Kernel Invocation Interface

Ready/Valid Handshaking Kernel Invocation Interface

Memory-Mapped Host Interfaces

Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the annotated_arg Class

Buffer Locations in Memory-Mapped Host Interfaces

The annotated_arg Template Class

Memory-Mapped Host Interfaces Using Accessors

(Deprecated) Memory-Mapped Host Interfaces Using Unified Shared Memory Pointers and the mmhost Macro

Loops

Refactor the Loop-Carried Data Dependency

Relax Loop-Carried Dependency

Transfer Loop-Carried Dependency to Local Memory

Minimize the Memory Dependencies for Loop Pipelining

Unroll Loops

Fuse Loops to Reduce Overhead and Improve Performance

Optimize Loops With Loop Speculation

Remove Loop Bottlenecks

Improve fMAX/II with Shannonization

Optimize Inner Loop Throughput

Improve Loop Performance by Caching Data in On-Chip Memory

Pipes

Host Pipes

Host Pipe Declaration

Host Pipe API

Host Pipes RTL Interfaces

Pipes Extension

Key Properties of a Pipe

Accessing Pipes

The pipe Class and its Use

I/O Pipes

Characteristics of Pipes

Restrictions of Pipes

Guidelines for Designing Pipes

The atomic_fence Function and Pipes

Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe

Data Types and Arithmetic Operations

Optimize Floating-point Operation

Avoid Expensive Functions

Variable-Precision Data Type Support

Advantages and Limitations of Variable Precision Data Types

Declare and Use the AC Data Types

Declare the ac_int Data Type

Declare the ac_fixed Data Type

Declare the ac_complex Data Type

Declare the ap_float Data Type

Conversion Rules for the ap_float Data Type

Operations with Explicit Precision Controls

Comparison Operators

Additional ap_float Functions

Additional Data Types Provided by the ap_float.hpp Header File

Quality of Results and the ap_float Data Type

Parallelism

Pipelined Kernels

Stable Arguments

NDRange Kernels

Asynchronous Parallelism Within Kernels (task_sequence)

Task Functions

task_sequence Use Cases

Memories and Memory Operations

The device_global Extension (Beta)

Memory Accesses

Load-Store Units

Load-Store Unit Styles

Load-Store Unit Modifiers

Load-Store Unit Controls

Global Memory Accesses Optimization

Global Memory Bandwidth Use Calculation

Manual Partition of Global Memory

Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)

Partitioning Buffers Across Memory Channels of the Same Memory Type

Ignoring Dependencies Between Accessor Arguments

Contiguous Memory Accesses

Static Memory Coalescing

Perform Kernel Computations Using Local or Private Memory

Local and Private Memory Accesses Optimization

Annotating Unified Shared Memory Pointers

Zero-Copy Memory Access

Additional Recommendations

Kernel Variable Accesses

Libraries

Use SYCL Shared Library With Third-Party Applications

Use of RTL Libraries for FPGA

Object Manifest File Syntax of an RTL Library

Restrictions and Limitations in RTL Support

Intel® Stratix® 10 and Intel Agilex® 7 Design-Specific Reset Requirements for Stall-Free and Stallable RTL Libraries

Stall-Free RTL

Additional FPGA Acceleration Flow Considerations

FPGA-CPU Interaction

FPGA BSPs and Boards

FPGA Board Initialization

Obtain FPGA Hardware Image Information

Extracting the FPGA Hardware Configuration (.aocx) File from a Multiarchitecture Binary File

Targeting Multiple Homogeneous FPGA Devices

Targeting Multiple Platforms

Split Kernel into Multiple FPGA Images (Linux only)

Additional SYCL* HLS Flow Considerations

IP Component Reset Behavior

FPGA Optimization Flags, Attributes, Pragmas, and Extensions

Optimization Flags

Specify Schedule FMAX Target for Kernels (-Xsclock=<clock target>)

Create a 2xclock Interface (-Xsuse-2xclock)

Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_name>)

Force Ring Interconnect for Global Memory (-Xsglobal-ring)

Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring)

Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder)

Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue)

Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking)

Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion)

Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion)

Pipeline Loops in Non-task Kernels (-Xsauto-pipeline)

Control Semantics of Floating-Point Operations (-fp-model=<value>)

Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>)

Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<value>)

Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<N>)

Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<option>)

Generate Register Map Wrapper (-Xsregister-map-wrapper-type)

Optimization Targets

Minimum Latency Flow

Balanced Throughput-Area Trade-Offs Flow

Kernel Variables

Kernel Attributes

Specify Schedule FMAX Target for Kernels

Specify a Workgroup Size

Specify Number of SIMD Work Items

Omit Hardware that Generates and Dispatches Kernel IDs

Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels

Reduce Kernel Area and Latency

Memory Attributes

Loop Directives

disable_loop_pipelining Attribute

initiation_interval Attribute

ivdep Attribute

loop_coalesce Attribute

max_concurrency Attribute

max_interleaving Attribute

speculated_iterations Attribute

unroll Pragma

Loop Fuse Functions and nofusion Attribute

max_reinvocation_delay Attribute (Beta)

Floating-Point Pragmas

Latency Controls (Beta)

Quick Reference

Algorithmic C Data Types

Floating Point Pragmas

FPGA Accessor Properties

FPGA Extensions

FPGA Kernel Attributes

FPGA Local Memory Function

Latency Control Properties (Beta)

FPGA LSU Controls

FPGA Loop Directives

FPGA Memory Attributes

FPGA Optimization Flags

Pipe API

Additional Information

Document Revision History for the Intel oneAPI FPGA Handbook

Notices and Disclaimers

Visible to Intel only — GUID: GUID-D241ADFB-4A7D-4F02-808D-5F4F88B7B17C

View Details

Measure Kernel Performance

The Profiler instruments and connects performance counters in a daisy chain throughout the pipeline generated for the kernel program. The host then reads data collected by these counters. For example, in PCI Express® (PCIe®)-based systems, the host reads the Profiler data over the PCIe interface.

Consider the following SYCL example code:


// Vector Add Kernel
h.single_task<VectorAdd>([=]() {
  for (int i = 0; i < kSize; ++i) {
    r[i] = a[i] + b[i];
  }
});

The profiler instruments and connects performance counters in a daisy chain throughout the pipeline generated for the kernel as shown in Figure 1. The host then reads the data collected by these counters. For example, in PCI Express® (PCIe)-based systems, the host reads the data via the PCIe control register access (CRA) or control and status register (CSR) port.

Intel® FPGA Dynamic Profiler for DPC++: Performance Counters Instrumentation

Applications that use many pipes or memory accesses might stall frequently to enable the completion of memory transfers. The dynamic profiler collects various performance metrics such as stall, occupancy, idle, and bandwidth data at these points in the pipeline to help identify memory or pipe operations that create stalls.

Parent topic: Intel® FPGA Dynamic Profiler for DPC++

Level Two Title

Intel® FPGA Dynamic Profiler for DPC++ Instrument the Kernel Pipeline with Performance Counters (-Xsprofile)

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel oneAPI FPGA Handbook

Measure Kernel Performance