Optimizing Memory Movement Between Host and Accelerator

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Optimizing Memory Movement Between Host and Accelerator

Buffers can be created using properties to control how they are allocated. One such property is use_host_ptr which informs the runtime that if possible, the host memory should be directly used by the buffer instead of a copy. This avoids the need to copy the content of the buffer back and forth between the host memory and the buffer memory, potentially saving time during buffer creation and destruction.

To take another case, when the GPU and CPU have shared memory, it is possible to avoid copies of memory through sharing of pages. But for page sharing to be possible, the allocated memory needs to have some properties like being aligned on page boundary. In case of discrete devices, the benefit may not be realized because any memory operation by the accelerator will have to go across PCIe or some other slower interface than the memory of the accelerator.

The following code shows how to print the memory addresses on the host, inside the buffer, and on the accelerator device inside the kernel.

int VectorAdd0(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b,
               AlignedVector<int> &sum, int iter) {
  sycl::range num_items{a.size()};

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  for (int i = 0; i < iter; i++) {
    sycl::buffer a_buf(a, props);
    sycl::buffer b_buf(b, props);
    sycl::buffer sum_buf(sum.data(), num_items, props);
    {
      sycl::host_accessor a_host_acc(a_buf);
      std::cout << "add0: buff memory address =" << a_host_acc.get_pointer()
                << "\n";
      std::cout << "add0: address of vector a = " << a.data() << "\n";
    }
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
      sycl::stream out(1024 * 1024, 1 * 128, h);

      h.parallel_for(num_items, [=](auto i) {
        if (i[0] == 0)
          out << "add0:  dev addr = " << a_acc.get_pointer() << "\n";
        sum_acc[i] = a_acc[i] + b_acc[i];
      });
    });
  }
  q.wait();
  return (0);
}

When this program is run, it can be seen that the addresses for all three (on the host, in the buffer, and on the accelerator) are the same when the property use_host_ptr is set for integrated GPU devices. But for discrete GPU devices, the buffer and device addresses will be different. Also note that in line 1, none of the incoming arguments are declared to be const. If these are declared const then during buffer creation they are copied and new memory is allocated instead of reusing the memory in the host vectors. The code snippet below demonstrates this. When this code is executed, we see that the addresses associated with the incoming vectors are different from the memory present in the buffer and also the memory present in the accelerator device.

int VectorAdd1(sycl::queue &q, const AlignedVector<int> &a,
               const AlignedVector<int> &b, AlignedVector<int> &sum, int iter) {
  sycl::range num_items{a.size()};

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  for (int i = 0; i < iter; i++) {
    sycl::buffer a_buf(a, props);
    sycl::buffer b_buf(b, props);
    sycl::buffer sum_buf(sum.data(), num_items, props);
    {
      sycl::host_accessor a_host_acc(a_buf);
      std::cout << "add1: buff memory address =" << a_host_acc.get_pointer()
                << "\n";
      std::cout << "add1: address of vector aa = " << a.data() << "\n";
    }
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
      sycl::stream out(16 * 1024, 16 * 1024, h);

      h.parallel_for(num_items, [=](auto i) {
        if (i[0] == 0)
          out << "add1: dev addr = " << a_acc.get_pointer() << "\n";
        sum_acc[i] = a_acc[i] + b_acc[i];
      });
    });
  }
  q.wait();
  return (0);
}

The kernel vectorAdd3 will not incur the cost of copying the memory contents from the buffer to the accelerator device because the use_host_ptr property is set while creating the buffers, and the buffers are aligned on a page boundary for an integrated GPU device. If memory pointed to by a buffer is not aligned on a page boundary, then new memory is allocated that aligns on a page boundary and the contents of the buffer are copied into that memory. This new memory from the buffer is then shared with the accelerator either by copying the contents from the buffer on host to the device (for accelerators that do not share any memory) or by using the page tables to avoid a physical copy of memory available on the device (for accelerators that share memory).

int VectorAdd2(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b,
               AlignedVector<int> &sum, int iter) {
  sycl::range num_items{a.size()};

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < iter; i++) {
    sycl::buffer a_buf(a, props);
    sycl::buffer b_buf(b, props);
    sycl::buffer sum_buf(sum.data(), num_items, props);
    q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items,
                     [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
    });
  }
  q.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "Vector add2 completed on device - took "
            << (end - start).count() << " u-secs\n";
  return ((end - start).count());
}

The kernel below will incur the cost of copying memory contents between the host and buffer, and also from the buffer to the accelerator.

int VectorAdd3(sycl::queue &q, const AlignedVector<int> &a,
               const AlignedVector<int> &b, AlignedVector<int> &sum, int iter) {
  sycl::range num_items{a.size()};

  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < iter; i++) {
    sycl::buffer a_buf(a);
    sycl::buffer b_buf(b);
    sycl::buffer sum_buf(sum.data(), num_items);
    auto e = q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items,
                     [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
    });
  }
  q.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "Vector add3 completed on device - took "
            << (end - start).count() << " u-secs\n";
  return ((end - start).count());
}

Care must be taken to ensure that unnecessary copies are avoided during the creation of buffers and passing the memory from the buffers to the kernels. Even when the accelerator shares memory with the host, a few additional conditions must be satisfied to avoid these extra copies.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Optimizing Memory Movement Between Host and Accelerator