Visible to Intel only — GUID: GUID-D3A3DE97-52BA-4C2D-B293-F628351DB984
Visible to Intel only — GUID: GUID-D3A3DE97-52BA-4C2D-B293-F628351DB984
Optimizing Data Transfers
Introduction
The previous section discussed minimizing data transfers between host and device. This section discusses ways to optimize data transfers when there is a need to move data between host and device.
When moving data between host and device, data transfer rate is maximized when both source and destination are in Unified Shared Memory (USM).
In SYCL, data transfers between host and device may be explicit via the use of the SYCL memcpy function, or implicit via the use of SYCL buffers and accessors.
Case 1: Data transfers using buffers
In the case of SYCL buffers constructed with a pointer to pre-allocated system memory (i.e., memory allocated using malloc or new), the SYCL runtime has the capability to automatically promote system memory to host USM at buffer creation, and to release it at buffer destruction. To enable this capability, a buffer has to be created with a hostptr, for example:
int* hostptr = (int*)malloc(4*sizeof(int));
buffer<int, 1> Buffer(hostptr,4);
And then the environment variable SYCL_USM_HOSTPTR_IMPORT=1 should be set at runtime.
Case 2: Data transfers using SYCL data movement APIs
When the host data allocation is under user control, the user may use USM functions such as malloc_host to allocate host USM memory instead of system memory (i.e., memory allocated using malloc or new), if the allocated pointer is going to be used in a data transfer API.
If the source code where the allocation is made is not available, or source code changes are prohibited for portability reasons, then system memory can be promoted (imported) to host USM before the first data transfer, and released from host USM after all data transfers are completed. A set of APIs are provided for explicit import/release.
Experimental SYCL Import/Release APIs
The following SYCL import and release APIs give the programmer explicit control over the range and duration of imports. The APIs are analogous to the malloc_host APIs:
void* sycl::prepare_for_device_copy
(void* ptr, size_t numBytes, sycl::context& syclContext)
void* sycl::prepare_for_device_copy
(void* ptr, size_t numBytes, sycl::queue& syclQueue)
void sycl::release_from_device_copy
(void* ptr, sycl::context& syclContext)
void sycl::release_from_device_copy
(void* ptr, sycl::queue& syclQueue)
The APIs are simple to use, but the onus is on the user to ensure correctness and safety with respect to the lifetime ranges.
Note that if numBytes is not the same as the size of the malloc’ed memory block (for example, if the malloc’ed memory is 1024 bytes, but numBytes is 512 or 2048), the guidance here would be to use a size for the import that matches the data transfer size since that much memory is expected to be allocated. If the true allocation is bigger, then importing less than the true allocation has no ill effects. Importing more than the true allocation is a user error (as is the transfer).
Note:
The import/release APIs are experimental and are subject to change.
SYCL Example
The following example, bench_l0import_sycl.cpp, measures the rate of data transfer between host and device. The data transfer size can be varied. The program prints device-to-host and host-to-device data transfer rate, measured in Gigabytes per second. Switches passed to the program are used to control data transfer direction, whether or not the SYCL import feature is used, and a range of transfer sizes. The switches are listed in the example source.
In the program, the import is done once at the beginning, and the release is done once at the end, using the following calls:
sycl::ext::oneapi::experimental::prepare_for_device_copy(
hostp, transfer_upper_limit, dq);
sycl::ext::oneapi::experimental::release_from_device_copy(hostp, dq);
In between the above calls, there is a loop that repeatedly does memcpy involving the imported pointer hostp.
#include <math.h>
#include <stdlib.h>
#include <chrono>
#include <time.h>
#include <unistd.h>
#include <sycl/sycl.hpp>
using namespace sycl;
static const char *usage_str =
"\n ze_bandwidth [OPTIONS]"
"\n"
"\n OPTIONS:"
"\n -t, string selectively run a particular test:"
"\n h2d or H2D run only Host-to-Device tests"
"\n d2h or D2H run only Device-to-Host tests "
"\n [default: both]"
"\n -q minimal output"
"\n [default: disabled]"
"\n -v enable verificaton"
"\n [default: disabled]"
"\n -i set number of iterations per transfer"
"\n [default: 500]"
"\n -s select only one transfer size (bytes) "
"\n -sb select beginning transfer size (bytes)"
"\n [default: 1]"
"\n -se select ending transfer size (bytes)"
"\n [default: 2^28]"
"\n -l use SYCL prepare_for_device_copy/release_from_device_copy APIs"
"\n [default: disabled]"
"\n -h, --help display help message"
"\n";
static uint32_t sanitize_ulong(char *in) {
unsigned long temp = strtoul(in, NULL, 0);
if (ERANGE == errno) {
fprintf(stderr, "%s out of range of type ulong\n", in);
} else if (temp > UINT32_MAX) {
fprintf(stderr, "%ld greater than UINT32_MAX\n", temp);
} else {
return static_cast<uint32_t>(temp);
}
return 0;
}
size_t transfer_lower_limit = 1;
size_t transfer_upper_limit = (1 << 28);
bool verify = false;
bool run_host2dev = true;
bool run_dev2host = true;
bool verbose = true;
bool l0import = false;
uint32_t ntimes = 500;
// kernel latency
int main(int argc, char **argv) {
for (int i = 1; i < argc; i++) {
if ((strcmp(argv[i], "-h") == 0) || (strcmp(argv[i], "--help") == 0)) {
std::cout << usage_str;
exit(0);
} else if (strcmp(argv[i], "-q") == 0) {
verbose = false;
} else if (strcmp(argv[i], "-v") == 0) {
verify = true;
} else if (strcmp(argv[i], "-l") == 0) {
l0import = true;
} else if (strcmp(argv[i], "-i") == 0) {
if ((i + 1) < argc) {
ntimes = sanitize_ulong(argv[i + 1]);
i++;
}
} else if (strcmp(argv[i], "-s") == 0) {
if ((i + 1) < argc) {
transfer_lower_limit = sanitize_ulong(argv[i + 1]);
transfer_upper_limit = transfer_lower_limit;
i++;
}
} else if (strcmp(argv[i], "-sb") == 0) {
if ((i + 1) < argc) {
transfer_lower_limit = sanitize_ulong(argv[i + 1]);
i++;
}
} else if (strcmp(argv[i], "-se") == 0) {
if ((i + 1) < argc) {
transfer_upper_limit = sanitize_ulong(argv[i + 1]);
i++;
}
} else if ((strcmp(argv[i], "-t") == 0)) {
run_host2dev = false;
run_dev2host = false;
if ((i + 1) >= argc) {
std::cout << usage_str;
exit(-1);
}
if ((strcmp(argv[i + 1], "h2d") == 0) ||
(strcmp(argv[i + 1], "H2D") == 0)) {
run_host2dev = true;
i++;
} else if ((strcmp(argv[i + 1], "d2h") == 0) ||
(strcmp(argv[i + 1], "D2H") == 0)) {
run_dev2host = true;
i++;
} else {
std::cout << usage_str;
exit(-1);
}
} else {
std::cout << usage_str;
exit(-1);
}
}
queue dq;
device dev = dq.get_device();
size_t max_compute_units = dev.get_info<info::device::max_compute_units>();
auto BE = dq.get_device()
.template get_info<sycl::info::device::opencl_c_version>()
.empty()
? "L0"
: "OpenCL";
if (verbose)
std::cout << "Device name " << dev.get_info<info::device::name>() << " "
<< "max_compute units"
<< " " << max_compute_units << ", Backend " << BE << "\n";
void *hostp;
posix_memalign(&hostp, 4096, transfer_upper_limit);
memset(hostp, 1, transfer_upper_limit);
if (l0import) {
if (verbose)
std::cout << "Doing L0 Import\n";
sycl::ext::oneapi::experimental::prepare_for_device_copy(
hostp, transfer_upper_limit, dq);
}
void *destp =
malloc_device<char>(transfer_upper_limit, dq.get_device(), dq.get_context());
dq.submit([&](handler &cgh) { cgh.memset(destp, 2, transfer_upper_limit); });
dq.wait();
if (run_host2dev) {
if (!verbose)
printf("SYCL USM API (%s)\n", BE);
for (size_t s = transfer_lower_limit; s <= transfer_upper_limit; s <<= 1) {
auto start_time = std::chrono::steady_clock::now();
for (int i = 0; i < ntimes; ++i) {
dq.submit([&](handler &cgh) { cgh.memcpy(destp, hostp, s); });
dq.wait();
}
auto end_time = std::chrono::steady_clock::now();
std::chrono::duration<double> seconds = end_time - start_time;
if (verbose)
printf("HosttoDevice: %8lu bytes, %7.3f ms, %8.3g GB/s\n", s,
1000 * seconds.count() / ntimes,
1e-9 * s / (seconds.count() / ntimes));
else
printf("%10.6f\n", 1e-9 * s / (seconds.count() / ntimes));
}
}
if (run_dev2host) {
if (!verbose)
printf("SYCL USM API (%s)\n", BE);
for (size_t s = transfer_lower_limit; s <= transfer_upper_limit; s <<= 1) {
auto start_time = std::chrono::steady_clock::now();
for (int i = 0; i < ntimes; ++i) {
dq.submit([&](handler &cgh) { cgh.memcpy(hostp, destp, s); });
dq.wait();
}
auto end_time = std::chrono::steady_clock::now();
std::chrono::duration<double> seconds = end_time - start_time;
if (verbose)
printf("DeviceToHost: %8lu bytes, %7.3f ms, %8.3g GB/s\n", s,
seconds.count(), 1e-9 * s / (seconds.count() / ntimes));
else
printf("%10.6f\n", 1e-9 * s / (seconds.count() / ntimes));
}
}
if (l0import)
sycl::ext::oneapi::experimental::release_from_device_copy(hostp, dq);
free(hostp);
free(destp, dq.get_context());
}
Compilation command:
$ icpx -fsycl bench_l0import_sycl.cpp
Example run command:
$ a.out -l -t h2d -s 256000000
The above run command specifies doing the import (-l), host-to-device (-t h2d), and transferring 256 million bytes (-s 256000000).
Date Transfer Rate Measurements
We use the program, bench_l0import_sycl.cpp, to measure the rate of data transfer between host and device (with and without import/release). The transfer rate for a data size of 256 million bytes when running on the particular GPU used (1-stack only) was as follows.
Run Command |
Transfer Direction |
|
|
---|---|---|---|
a.out -t h2d -s 256000000 |
host to device |
No |
26.9 |
a.out -l -t h2d -s 256000000 |
host to device |
Yes |
45.4 |
a.out -t d2h -s 256000000 |
device to host |
No |
33.2 |
a.out -l -t d2h -s 256000000 |
device to host |
Yes |
48.0 |
Performance Recommendations
For optimal data transfer between host and device, we recommend the following approaches:
Allocate data that will be the source or destination of data transfer between host and device in host USM memory. This can be done using the API malloc_host in SYCL. This way, no extra copies are necessary and the data transfer rate is optimal.
If the above option (1) is not feasible, then use explicit import/ release APIs. In SYCL, use the prepare_for_device_copy and release_from_device_copy APIs listed above.