Intel® FPGA SDK for OpenCL™ Standard Edition: Programming Guide

ID 683342
Date 4/22/2019
Public
Document Table of Contents

6.7. Allocating Shared Memory for OpenCL Kernels Targeting SoCs

Intel® recommends that OpenCL™ kernels that run on Intel® SoCs access shared memory instead of the FPGA DDR memory. FPGA DDR memory is accessible to kernels with very high bandwidths. However, read and write operations from the ARM® CPU to FPGA DDR memory are very slow because they do not use direct memory access (DMA). Reserve FPGA DDR memory only for passing temporary data between kernels or within a single kernel for testing purposes.
Note:
  • Mark the shared buffers between kernels as volatile to ensure that buffer modification by one kernel is visible to the other kernel.
  • To access shared memory, you only need to modify the host code. Modifications to the kernel code are unnecessary.
  • You cannot use the library function malloc or the operator new to allocate physically shared memory. Also, the CL_MEM_USE_HOST_PTR flag does not work with shared memory.

    In DDR memory, shared memory must be physically contiguous. The FPGA cannot consume virtually contiguous memory without a scatter-gather direct memory access (SG-DMA) controller core. The malloc function and the new operator are for accessing memory that is virtually contiguous.

  • CPU caching is disabled for the shared memory.
  • When you use shared memory, one copy of the data is used for both the host and the kernel. When this memory is used, OpenCL memory calls are done as zero-copy transfers for buffer reads, buffer writers, maps, and unmaps.
The ARM CPU and the FPGA can access the shared memory simultaneously. You do not need to include the clEnqueueReadBuffer and clEnqueueWriteBuffer calls in your host code to make data visible to either the FPGA or the CPU.
  • To allocate and access shared memory, structure your host code in a similar manner as the following example:
    cl_mem src = clCreateBuffer(…, CL_MEM_ALLOC_HOST_PTR, size, …);
    int *src_ptr  = (int*)clEnqueueMapBuffer  (…, src, size, …);
    *src_ptr = input_value; //host writes to ptr directly
    clSetKernelArg (…, src);
    clEnqueueNDRangeKernel(…);
    clFinish();
    printf (“Result = %d\n”, *dst_ptr); //result is available immediately
    clEnqueueUnmapMemObject(…, src, src_ptr, …);
    clReleaseMemObject(src); // actually frees physical memory
    
    You can include the CONFIG_CMA_SIZE_MBYTES kernel configuration option to control the maximum total amount of shared memory available for allocation. In practice, the total amount of allocated shared memory is smaller than the value of CONFIG_CMA_SIZE_MBYTES.
    Important:
    1. If your target board has multiple DDR memory banks, the clCreateBuffer(..., CL_MEM_READ_WRITE, ...) function allocates memory to the nonshared DDR memory banks. However, if the FPGA has access to a single DDR bank that is shared memory, then clCreateBuffer(..., CL_MEM_READ_WRITE, ...) allocates to shared memory, similar to using the CL_MEM_ALLOC_HOST_PTR flag.
    2. The shared memory that you request with the clCreateBuffer(..., CL_MEM_ALLOC_HOST_PTR, size, ...) function is allocated in the Linux OpenCL kernel driver, and it relies on the contiguous memory allocator (CMA) feature of the Linux kernel. For detailed information on enabling and configuring the CMA, refer to the Recompiling the Linux Kernel and the OpenCL Linux Kernel Driver section of the Intel® FPGA SDK for OpenCL™ Cyclone V SoC Development Kit Reference Platform Porting Guide.
  • To transfer data from shared hard processor system (HPS) DDR to FPGA DDR efficiently, include a kernel that performs the memcpy function, as shown below.
    __attribute__((num_simd_work_items(8)))
    mem_stream(__global uint * src, __global uint * dst)
    {
        size_t gid = get_global_id(0);
        dst[gid] = src[gid];
    }
    
    Attention: Allocate the src pointer in the HPS DDR as shared memory using the CL_MEM_ALLOC_HOST_PTR flag.
  • If the host allocates constant memory to shared HPS DDR system and then modifies it after kernel execution, the modifications might not take effect. As a result, subsequent kernel executions might use outdated data. To prevent kernel execution from using outdated constant memory, perform one of the following tasks:
    1. Do not modify constant memory after its initialization.
    2. Create multiple constant memory buffers if you require multiple __constant data sets.
    3. If available, allocate constant memory to the FPGA DDR on your accelerator board.