Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 7/13/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Pipe and Atomic Fence

ATTENTION:

This topic assumes that you already have an understanding of the atomic_fence function described in the SYCL specification. If you are new to it, then before you proceed, read about the atomic_fence function in the Khronos* SYCL Specification.

When running kernels in parallel, you might want multiple kernels to collaboratively access a shared memory. SYCL* provides the atomic_fence function as a synchronization construct to reason about the order of memory instructions accessing the shared memory. The atomic_fence function controls the reordering of memory load and store operations (subject to the associated memory order and memory scope) when paired with synchronization through an atomic object. Pipe read and write operations behave as if they are SYCL-relaxed atomic load and store operations. When paired with atomic_fence functions to establish a synchronizes-with relationship, pipe operations can provide guarantee on side-effect visibility in memory, as defined by the SYCL memory model. For additional information about the atomic_fence function, refer to the Khronos* SYCL Specification.

CAUTION:

The current atomic_fence function for FPGA uses an overly conservative implementation and is still preliminary.

  • The implementation guarantees only functional correctness and not the maximum performance because the atomic_fence function currently enforces more memory ordering than it requires. If you do not use the atomic_fence function with a correct memory_order parameter, then you might see unexpected behavior in your program when the atomic_fence function handles memory ordering properly in a future release.
  • The implementation does not support the memory_scope::system constraint. The broadest scope supported for FPGA is the memory_scope::device constraint.

Example Code for Using the atomic_fence Function and Blocking Inter-Kernel Pipes

The following code sample shows how to use the atomic_fence function with a blocking inter-kernel pipe to synchronize the load and store to a shared device memory between a producer and a consumer:

#include <sycl/sycl.hpp>
using namespace sycl;
using my_pipe = ext::intel::pipe<class some_pipe, int>;
constexpr int READY = 1;
 
int produce_data(int data);
int consume_data(int data);
 
event Producer(queue&q, int *shared_ptr, size_t size) {
  return q.submit([&](handler& h) {
    h.single_task<class ProducerKernel>([=]() [[intel::kernel_args_restrict]] {
      // create a device pointer to explicitly inform the compiler the
      // pointer resides in the device's address space
      device_ptr<int> shared_ptr_d(shared_ptr);
 
      // produce data
      for (size_t i = 0; i < size; i++) {
        shared_ptr_d[i] = produce_data(i);
      }
      // use atomic_fence to ensure memory ordering
      atomic_fence(memory_order::seq_cst, memory_scope::device);
      // notify the consumer to start data processing
      my_pipe::write(READY);
    });
  }
}
 
event Consumer(queue & q, int* shared_ptr, size_t size, int *output_ptr) {
  return q.submit([&](handler& h) {
    h.single_task<class ConsumerKernel>([=]() [[intel::kernel_args_restrict]] {
      // create device pointers to explicitly inform the compiler these
      // pointer reside in the device's address space
      device_ptr<int> shared_ptr_d(shared_ptr);
      device_ptr<int> out_ptr_d(output_ptr);
 
      // wait on the blocking pipe_read until notified by the producer
      int ready = my_pipe::read();
 
      // use atomic_fence to ensure memory ordering
      atomic_fence(memory_order::seq_cst, memory_scope::device);
 
      // consume data and write to output memory address
      for(int i = 0; i < size; i++) {
        out_ptr_d[i] = consume_data(shared_ptr_d[i]);
      }
    });
  });
}

In the above example, the consumer loads data produced by the producer. To prevent a scenario where the consumer loads the shared device memory before the producer finishes storing to it, a blocking pipe is used to synchronize between the two kernels. The consumer’s pipe read does not return until it sees the READY written by the producer. In this example, the atomic_fence functions in the producer and consumer prevent the shared memory read and write from being reordered with the pipe instructions. They also form a release-acquire ordering, which ensures that by the time the consumer sees the pipe read returns, the producer's write operation to the shared device memory is also visible to the consumer.

NOTE:

The shared device memory is created using a USM device allocation that allows the two kernels to be running in parallel even though they both access the shared device memory simultaneously.