Developer Guide

Intel® oneAPI DPC++/C++ Compiler Handbook for FPGAs

ID 785441
Date 6/24/2024
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Prepinning Memory

You must consider how the transfer of data from the host to the device occurs when optimizing kernel memory accesses. For designs that have longer data transfer times than the compute time, the data transfer time may be a bottleneck. On devices supporting greater than a PCIe Gen3 x 8 transfer rates, prepinning the memory that is on the host prior to its transfer allows for it to transfer at a higher bandwidth.

For example, the following code snippet shows how to copy the prepinned memory to the device global memory when using a board with PCIe Gen3 x16 transfer rate. The memory transfer rate with prepinning achieves approximately 12 GB/s in half-duplex and 21 GB/s in full-duplex.

intel::fpga_selector device_selector;
auto device_queue = queue(device_selector);
int* data = malloc_host<int>(1024, device_queue);
… // initialize the data
int* data_device = malloc_device<int>(1024, device_queue);
device_queue.template copy<int>(data_device, data, 1024);
RESTRICTION:
  • Most BSPs implement the Unified Shared Memory (USM) call malloc_host() using prepinned memory. Hence, a prepinned memory is available only on devices that support USM host allocation.
  • SYCL USM host allocations are only supported by some BSPs. Check with your BSP vendor to see if they support SYCL USM host allocations.

    The OFS Intel® oneAPI Accelerator Support Package (ASP) supports USM. For details, refer https://github.com/OFS/oneapi-asp/releases.

Pinned memory is a scarce resource on the system (limited by the physical RAM available on your system), so carefully consider which buffers you want to pin to avoid exceeding the system limit. In addition, pinning itself is an expensive operation, so for optimal performance, ensure that the creation of pinned buffers takes place outside the main compute loop.