Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/13/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

5. Profiling Your Kernel to Identify Performance Bottlenecks

The Intel® FPGA dynamic profiler for OpenCL™ uses performance counters to collect kernel performance data during the design's execution. This data can be viewed using the Intel® VTune Profiler.
Tip: If you are looking for information about the Intel FPGA dynamic profiler for DPC++, then refer to the Intel® FPGA Dynamic Profiler for DPC++ section in the Intel® oneAPI DPC++ FPGA Optimization Guide.

Consider the following OpenCL kernel program:

__kernel void add (__global int * a,
                   __global int * b, 
                   __global int * c)
{
    int gid = get_global_id(0);
    c[gid] = a[gid]+b[gid];
}

As shown in the figure below, the Profiler instruments and connects performance counters in a daisy chain throughout the pipeline generated for the kernel program. The host then reads the data collected by these counters. For example, in PCI Express® (PCIe®)-based systems, the host reads the data via the PCIe control register access (CRA) or control and status register (CSR) port.

Figure 71.  Intel® FPGA Dynamic Profiler for OpenCL™ : Performance Counters Instrumentation

Work-item execution stalls might occur at various stages of an Intel® FPGA SDK for OpenCL™ pipeline. Applications with large amounts of memory accesses or load and store operations might stall frequently to enable the completion of memory transfers. The Profiler helps identify the load and store operations or channel accesses that cause the majority of stalls within a kernel pipeline.