Visible to Intel only — GUID: GUID-4BD20F8E-498A-4CA9-87D0-826815BEF855
Visible to Intel only — GUID: GUID-4BD20F8E-498A-4CA9-87D0-826815BEF855
Data Parallelism in C++ using SYCL*
Open, Multivendor, Multiarchitecture support for productive data parallel programming in C++ is accomplished via standard C++ with support for SYCL. SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that enables code for heterogeneous processors to be written using standard ISO C++ with the host and kernel code for an application contained in the same source file. The DPC++ open source project is adding SYCL support to the LLVM C++ compiler.
Simple Sample Code
The best way to introduce SYCL is through an example. Since SYCL is based on modern C++, this example uses several features that have been added to C++ in recent years, such as lambda functions and uniform initialization. Even if developers are not familiar with these features, their semantics will become clear from the context of the example. After gaining some experience with SYCL, these newer C++ features will become second nature.
The following application sets each element of an array to the value of its index, so that a[0] = 0, a[1] = 1, etc.
#include <CL/sycl.hpp> #include <iostream> constexpr int num=16; using namespace sycl; int main() { auto r = range{num}; buffer<int> a{r}; queue{}.submit([&](handler& h) { accessor out{a, h}; h.parallel_for(r, [=](item<1> idx) { out[idx] = idx; }); }); host_accessor result{a}; for (int i=0; i<num; ++i) std::cout << result[i] << "\n"; }
The first thing to notice is that there is just one source file: both the host code and the offloaded accelerator code are combined in a single source file. The second thing to notice is that the syntax is standard C++: there aren’t any new keywords or pragmas used to express the parallelism. Instead, the parallelism is expressed through C++ classes. For example, the buffer class on line 9 represents data that will be offloaded to the device, and the queue class on line 11 represents a connection from the host to the accelerator.
The logic of the example works as follows. Lines 8 and 9 create a buffer of 16 int elements, which have no initial value. This buffer acts like an array. Line 11 constructs a queue, which is a connection to an accelerator device. This simple example asks the SYCL runtime to choose a default accelerator device, but a more robust application would probably examine the topology of the system and choose a particular accelerator. Once the queue is created, the example calls the submit() member function to submit work to the accelerator. The parameter to this submit() function is a lambda function, which executes immediately on the host. The lambda function does two things. First, it creates an accessor on line 12, which can write elements in the buffer. Second, it calls the parallel_for() function on line 13 to execute code on the accelerator.
The call to parallel_for() takes two parameters. One parameter is a lambda function, and the other is the range object “r” that represents the number of elements in the buffer. SYCL arranges for this lambda to be called on the accelerator once for each index in that range, i.e. once for each element of the buffer. The lambda simply assigns a value to the buffer element by using the out accessor that was created on line 12. In this simple example, there are no dependencies between the invocations of the lambda, so the program is free to execute them in parallel in whatever way is most efficient for this accelerator.
After calling parallel_for(), the host part of the code continues running without waiting for the work to complete on the accelerator. However, the next thing the host does is to create a host_accessor on line 18, which reads the elements of the buffer. The SYCL runtime knows this buffer is written by the accelerator, so the host_accessor constructor (line 18) is blocked until the work submitted by the parallel_for() is complete. Once the accelerator work completes, the host code continues past line 18, and it uses the out accessor to read values from the buffer.
Additional Resources
This introduction to SYCL is not meant to be a complete tutorial. Rather, it just gives you a flavor of the language. There are many more features to learn, including features that allow you to take advantage of common accelerator hardware such as local memory, barriers, and SIMD. There are also features that let you submit work to many accelerator devices at once, allowing a single application to run work in parallel on many devices simultaneously.
The following resources are useful to learning and mastering SYCL using a DPC++ compiler:
Explore SYCL with Samples from Intel provides an overview and links to simple sample applications available from GitHub*.
The DPC++ Foundations Code Sample Walk-Through is a detailed examination of the Vector Add sample code, the DPC++ equivalent to a basic Hello World application.
The oneapi.com site includes a Language Guide and API Reference with descriptions of classes and their interfaces. It also provides details on the four programming models - platform model, execution model, memory model, and kernel programming model.
The DPC++ Essentials training course is a guided learning path for SYCL using Jupyter* Notebooks on Intel® DevCloud.
Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL is a comprehensive book that introduces and explains key programming concepts and language details about SYCL.