Visible to Intel only — GUID: GUID-6FCBC3E9-A4F5-4864-AE49-013307CEA5B0
Visible to Intel only — GUID: GUID-6FCBC3E9-A4F5-4864-AE49-013307CEA5B0
Using Multiple Heterogeneous Devices
Most accelerators reside in a server that has a significant amount of compute resources in it. For instance, a typical server can have up to eight sockets, with each socket containing over 50 cores. SYCL provides the ability to treat the CPUs and the accelerators uniformly to distribute work among them. It is the responsibility of the programmer to ensure a balanced distribution of work among the heterogeneous compute resources in the platform.
Overlapping Compute on Various Devices
SYCL provides access to different kinds of devices through abstraction of device selectors. Queues can be created for each of the devices, and kernels can be submitted to them for execution. All kernel submits in SYCL are non-blocking, which means that once the kernel is submitted to a queue for execution, the host does not wait for it to finish unless waiting on the queue is explicitly requested. This allows the host to do some work itself or initiate work on other devices while the kernel is executing on the accelerator.
The host CPU can be treated as an accelerator and the SYCL can submit kernels to it for execution. This is completely independent and orthogonal to the job done by the host to orchestrate the kernel submission and creation. The underlying operating system manages the kernels submitted to the CPU accelerator as another process and uses the same openCL/Level0 runtime mechanisms to exchange information with the host device.
The following example shows a simple vector add operation that works on a single GPU device.
size_t VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto start = std::chrono::steady_clock::now(); for (int i = 0; i < iter; i++) { auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector add1 completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd1
In the following kernel the input vector is split into two parts and computation is done on two different accelerators (one CPU and one GPU) that can execute concurrently. Care must be taken to ensure that the kernels, in addition to be being submitted, are actually launched on the devices to get this parallelism. The actual time that a kernel is launched can be substantially later than when it was submitted by the host. The implementation decides the time to launch the kernels based on some heuristics to maximize metrics like utilization, throughput, or latency. For instance, in the case of the OpenCL backend, on certain platforms one needs to explicitly issue a clFlush (as shown on line 41) on the queue to launch the kernels on the accelerators.
size_t VectorAdd2(sycl::queue &q1, sycl::queue &q2, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size() / 2}; auto start = std::chrono::steady_clock::now(); { sycl::buffer a1_buf(a.data(), num_items); sycl::buffer b1_buf(b.data(), num_items); sycl::buffer sum1_buf(sum.data(), num_items); sycl::buffer a2_buf(a.data() + a.size() / 2, num_items); sycl::buffer b2_buf(b.data() + a.size() / 2, num_items); sycl::buffer sum2_buf(sum.data() + a.size() / 2, num_items); for (int i = 0; i < iter; i++) { q1.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a1_buf, h, sycl::read_only); sycl::accessor b_acc(b1_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum1_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); // do the work on host q2.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a2_buf, h, sycl::read_only); sycl::accessor b_acc(b2_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum2_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } // On some platforms this explicit flush of queues is needed // to ensure the overlap in execution between the CPU and GPU // cl_command_queue cq = q1.get(); // clFlush(cq); // cq=q2.get(); // clFlush(cq); } q1.wait(); q2.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector add2 completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd2
Checking the running time of the above two kernels, it can be seen that the application runs almost twice as fast as before since it has more hardware resources dedicated to solving the problem. In order to achieve good balance, you will have to split the work in proportion to the capability of the accelerator, instead of distributing it evenly as was done in the above example.