Visible to Intel only — GUID: GUID-79248497-CA18-4DAA-8792-EB9DDAF181D5
Visible to Intel only — GUID: GUID-79248497-CA18-4DAA-8792-EB9DDAF181D5
Local Barriers vs Global Atomics
Atomics allow multiple work-items in the kernel to work on shared resources. Barriers allow synchronization among the work-items in a work-group. It is possible to achieve the functionality of global atomics through judicious use of kernel launches and local barriers. Depending on the architecture and the amount of data involved, one or the other can have better performance.
In the following example, we try to sum a relatively small number of elements in a vector. This task is can be achieved in different ways. The first kernel shown below does this using only one work-item which walks through all elements of the vector and sums them up.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(data_size, [=](auto index) {
int glob_id = index[0];
if (glob_id == 0) {
int sum = 0;
for (size_t i = 0; i < N; i++)
sum += buf_acc[i];
sum_acc[0] = sum;
}
});
});
In the kernel shown below, the same problem is solved using global atomics, where every work-item updates a global variable with the value it needs to accumulate. Although there is a lot of parallelism here, the contention on the global variable is quite high and in most cases its performance will not be very good.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(data_size, [=](auto index) {
size_t glob_id = index[0];
auto v = sycl::atomic_ref<int, sycl::memory_order::relaxed,
sycl::memory_scope::device,
sycl::access::address_space::global_space>(
sum_acc[0]);
v.fetch_add(buf_acc[glob_id]);
});
});
In the following kernel, every work-item is responsible for accumulating multiple elements of the vector. This accumulation is done in parallel and then updated into an array that is shared among all work-items of the work-group. At this point all work-items of the work-group do a tree reduction using barriers to synchronize among themselves to reduce intermediate results in shared memory to the final result. This kernel explicitly created exactly one work-group and distributes the responsibility of all elements in the vector to the work-items in the work-group. Although it is not using the full capability of the machine in terms of the number of threads, sometimes this amount of parallelism is enough for small problem sizes.
Timer timer;
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
sycl::local_accessor<int, 1> scratch(work_group_size, h);
h.parallel_for(sycl::nd_range<1>{work_group_size, work_group_size},
[=](sycl::nd_item<1> item) {
size_t loc_id = item.get_local_id(0);
int sum = 0;
for (size_t i = loc_id; i < data_size;
i += num_work_items)
sum += buf_acc[i];
scratch[loc_id] = sum;
for (size_t i = work_group_size / 2; i > 0; i >>= 1) {
item.barrier(sycl::access::fence_space::local_space);
if (loc_id < i)
scratch[loc_id] += scratch[loc_id + i];
}
if (loc_id == 0)
sum_acc[0] = scratch[0];
});
});
The performance of these three kernels varies quite a bit among various platforms, and developers need to pick the technique that suits their application and hardware.