Avoid moving data back and forth between host and device

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Avoid moving data back and forth between host and device

The cost of moving data between host and device is quite high, especially in the case of discrete accelerators. So it is very important to avoid data transfers between host and device as much as possible. In some situations it may be required to bring the data that was computed by a kernel on the accelerator to the host and do some operation on it and send it back to the device for further processing. In such situation we will end up paying for the cost of device to host transfer and then again host to device transfer.

Consider the following example, where one kernel produces data through some operation (in this case vector add) into a new vector. This vector is then transformed into another vector by applying a function on each value and then fed as input into another kernel for some additional computation. This form of computation is quite common and occurs in many domains where algorithms are iterative and output from one computation needs to be fed as input into another computation. One classic example is machine learning models which are structured as layers of computation and output of one layer is input to the next layer.

double myFunc1(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b,
               AlignedVector<int> &c, AlignedVector<int> &d,
               AlignedVector<int> &res, int iter) {
  sycl::range num_items{a.size()};
  VectorAllocator<int> alloc;
  AlignedVector<int> sum(a.size(), alloc);

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(b, props);
  sycl::buffer d_buf(b, props);
  sycl::buffer res_buf(res, props);
  sycl::buffer sum_buf(sum.data(), num_items, props);

  Timer timer;
  for (int i = 0; i < iter; i++) {
    // kernel1
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items,
                     [=](auto id) { sum_acc[id] = a_acc[id] + b_acc[id]; });
    });

    {
      sycl::host_accessor h_acc(sum_buf);
      for (int j = 0; j < a.size(); j++)
        if (h_acc[j] > 10)
          h_acc[j] = 1;
        else
          h_acc[j] = 0;
    }

    // kernel2
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor sum_acc(sum_buf, h, sycl::read_only);
      sycl::accessor c_acc(c_buf, h, sycl::read_only);
      sycl::accessor d_acc(d_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items, [=](auto id) {
        res_acc[id] = sum_acc[id] * c_acc[id] + d_acc[id];
      });
    });
    q.wait();
  }
  double elapsed = timer.Elapsed() / iter;
  return (elapsed);
} // end myFunc1

Instead of bringing the data to the host and applying the function to the data and sending it back to the device in the second kernel, you can create a kernel to execute this function on the device itself. This has the advantage of avoiding the round trip of data from device to host. This technique is shown in the example below, which is functionally the same as the code before. We now introduce a third kernel kernel3 that operates on the intermediate data in accum_buf in between kernel1 and kernel2.

double myFunc2(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b,
               AlignedVector<int> &c, AlignedVector<int> &d,
               AlignedVector<int> &res, int iter) {
  sycl::range num_items{a.size()};
  VectorAllocator<int> alloc;
  AlignedVector<int> sum(a.size(), alloc);

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(b, props);
  sycl::buffer d_buf(b, props);
  sycl::buffer res_buf(res, props);
  sycl::buffer sum_buf(sum.data(), num_items, props);

  Timer timer;
  for (int i = 0; i < iter; i++) {
    // kernel1
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items,
                     [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
    });

    // kernel3
    q.submit([&](auto &h) {
      sycl::accessor sum_acc(sum_buf, h, sycl::read_write);
      h.parallel_for(num_items, [=](auto id) {
        if (sum_acc[id] > 10)
          sum_acc[id] = 1;
        else
          sum_acc[id] = 0;
      });
    });

    // kernel2
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor sum_acc(sum_buf, h, sycl::read_only);
      sycl::accessor c_acc(c_buf, h, sycl::read_only);
      sycl::accessor d_acc(d_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items, [=](auto i) {
        res_acc[i] = sum_acc[i] * c_acc[i] + d_acc[i];
      });
    });
    q.wait();
  }
  double elapsed = timer.Elapsed() / iter;
  return (elapsed);
} // end myFunc2

There are other ways to optimize this example. For instance, the clipping operation in kernel3 can be merged into the computation of kernel1 as shown below. This is kernel fusion and has the added advantage of not launching a third kernel. The SYCL compiler cannot do this kind of optimization. In some specific domains like machine learning, there are graph compilers that operate on the ML models and fuse the operations, which has the same impact.

double myFunc3(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b,
               AlignedVector<int> &c, AlignedVector<int> &d,
               AlignedVector<int> &res, int iter) {
  sycl::range num_items{a.size()};
  VectorAllocator<int> alloc;
  AlignedVector<int> sum(a.size(), alloc);

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(b, props);
  sycl::buffer d_buf(b, props);
  sycl::buffer res_buf(res, props);
  sycl::buffer sum_buf(sum.data(), num_items, props);

  Timer timer;
  for (int i = 0; i < iter; i++) {
    // kernel1
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items, [=](auto i) {
        int t = a_acc[i] + b_acc[i];
        if (t > 10)
          sum_acc[i] = 1;
        else
          sum_acc[i] = 0;
      });
    });

    // kernel2
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor sum_acc(sum_buf, h, sycl::read_only);
      sycl::accessor c_acc(c_buf, h, sycl::read_only);
      sycl::accessor d_acc(d_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items, [=](auto i) {
        res_acc[i] = sum_acc[i] * c_acc[i] + d_acc[i];
      });
    });
    q.wait();
  }
  double elapsed = timer.Elapsed() / iter;
  return (elapsed);
} // end myFunc3

We can take this kernel fusion one level further and fuse both kernel1 and kernel2 as shown in the code below. This gives very good performance since it avoids the intermediate accum_buf completely, saving memory in addition to launching an additional kernel. Most of the performance benefit in this case is due to improvement in locality of memory references.

double myFunc4(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b,
               AlignedVector<int> &c, AlignedVector<int> &d,
               AlignedVector<int> &res, int iter) {
  sycl::range num_items{a.size()};
  VectorAllocator<int> alloc;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(b, props);
  sycl::buffer d_buf(b, props);
  sycl::buffer res_buf(res, props);

  Timer timer;
  for (int i = 0; i < iter; i++) {
    // kernel1
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      sycl::accessor c_acc(c_buf, h, sycl::read_only);
      sycl::accessor d_acc(d_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items, [=](auto i) {
        int t = a_acc[i] + b_acc[i];
        if (t > 10)
          res_acc[i] = c_acc[i] + d_acc[i];
        else
          res_acc[i] = d_acc[i];
      });
    });
    q.wait();
  }
  double elapsed = timer.Elapsed() / iter;
  return (elapsed);
} // end myFunc4

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Avoid moving data back and forth between host and device