Visible to Intel only — GUID: GUID-6BAD4C6A-52BF-4DBE-ABF6-417BFC0A311D
Visible to Intel only — GUID: GUID-6BAD4C6A-52BF-4DBE-ABF6-417BFC0A311D
Memory Allocation
This section looks at various ways of allocating memory, and the types of allocations that are supported. A pointer on the host has the same size as a pointer on the device.
Host allocations are owned by the host and are intended to be allocated out of system memory. Host allocations are accessible by the host and all supported devices. Therefore, the same pointer to a host allocation may be used on the host and all supported devices. Host allocations are not expected to migrate between system memory and device-local memory. When a pointer to a host allocation is accessed on a device, data is typically sent over a bus, such as PCI-Express, that connects the device to the host.
Device allocations are owned by a specific device and are intended to be allocated out of device-local memory. Storage allocated can be read from and written to on that device, but is not directly accessible from the host or any other supported devices.
Shared allocations are accessible by the host and all supported devices. So the same pointer to a shared allocation may be used on the host and all supported devices, like in a host allocation. Shared allocations, however, are not owned by any particular device, but are intended to migrate between the host and one or more devices. This means that accesses on a device, after the migration has occurred, happen from much faster device-local memory instead of remotely accessing system memory though the higher-latency bus connection.
Shared-system allocations are a sub-class of shared allocations, where the memory is allocated by a system allocator (such as malloc or new) rather than by an allocation API (such as the OpenMP memory allocation API). Shared-system allocations have no associated device; they are inherently cross-device. Like other shared allocations, Shared-system allocations are intended to migrate between the host and supported devices, and the same pointer to a shared-system allocation may be used on the host and all supported devices.
Note:
Currently, shared-system allocations are not supported on Intel® Data Center GPU Max Series systems. However, shared allocations where memory is allocated by an allocation API are supported.
The following table summarizes the characteristics of the various types of memory allocation.
Type of allocation |
Initial location |
Accessible on host? |
Accessible on device? |
---|---|---|---|
Host |
Host |
Yes |
Yes |
Device |
Device |
No |
Yes |
Shared |
Host, Device, or Unspecified |
Yes |
Yes |
Shared-System |
Host |
Yes |
Yes |
Host allocations offer wide accessibility (can be accessed directly from the host and all supported devices), but have potentially high per-access costs because data is typically sent over a bus such as PCI Express*.
Shared allocations also offer wide accessibility, but the per-access costs are potentially lower than host allocations, because data is migrated to the accessing device.
Device allocations have access limitations (cannot be accessed directly from the host or other supported devices), but offer higher performance because accesses are to device-local memory.
OpenMP Runtime Routines for Memory Allocation
Intel compilers support a number of OpenMP runtime routines for performing memory allocations. These routines are shown in the table below.
OpenMP memory allocation routine |
Intel extension? |
Type of allocation |
---|---|---|
omp_target_alloc |
No |
Device |
omp_target_alloc_device |
Yes |
Device |
omp_target_alloc_host |
Yes |
Host |
omp_target_alloc_shared |
Yes |
Shared |
Note that the three routines omp_target_alloc_device, omp_target_alloc_host, and omp_target_alloc_shared are Intel extensions to the OpenMP specification.
The following examples use the above OpenMP memory allocation routines. Compare those to the ones using map clauses.
For more information about memory allocation, see:
The SYCL part of this guide
Using the map Clause
The first example uses map clauses to allocate memory on a device and copy data between the host and the device.
In the following example, arrays A, B, and C are allocated in system memory by calling the C/C++ standard library routine, malloc.
The target construct on line 58 is the main kernel that computes the values of array C on the device. The map(tofrom: C[0:length) clause is specified on this target construct since the values of C need to be transferred from the host to the device before the computation, and from the device to the host at the end of the computation. The map(to: A[0:length], B[0:length]) is specified for arrays``A`` and B since the values of these arrays need to be transferred from the host to the device, and the device only reads these values. Under the covers, the map clauses cause storage for the arrays to be allocated on the device and data to be copied from the host to the device, and vice versa.
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays on the host using plain malloc() A = (double *) malloc(bytes); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using plain malloc().\n"); exit(1); } B = (double *) malloc(bytes); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using plain malloc().\n"); exit(1); } C = (double *) malloc(bytes); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using plain malloc().\n"); exit(1); } // Initialize the arrays #pragma omp parallel for for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ map(to: A[0:length], B[0:length]) \ map(tofrom: C[0:length]) for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } free(A); free(B); free(C); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
Compilation command:
icpx -fiopenmp -fopenmp-targets=spir64 test_target_map.cpp
Run command:
OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0 LIBOMPTARGET_DEBUG=1 ./a.out
The map clauses on the target construct inside the iterations loop cause data (values of A, B, C) to be transferred from the host to the device at the beginning of each target region, and cause data (values of C) to be transferred from the device to the host at the end of each target region. These data transfers incur a significant performance overhead. A better approach using map clauses would be to put the whole iterations loop inside a target data construct with the map clauses. This causes the transfers to occur once at the beginning of the iterations loop, and another time at the end of the iterations loop. The modified example using target data and map clauses is shown below.
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays on the host using plain malloc() A = (double *) malloc(bytes); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using plain malloc().\n"); exit(1); } B = (double *) malloc(bytes); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using plain malloc().\n"); exit(1); } C = (double *) malloc(bytes); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using plain malloc().\n"); exit(1); } // Initialize the arrays #pragma omp parallel for for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation nstream_time = omp_get_wtime(); #pragma omp target data map(to: A[0:length], B[0:length]) \ map(tofrom: C[0:length]) { for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } free(A); free(B); free(C); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
omp_target_alloc
Next, the example above is modified to use device allocations instead of map clauses. Storage for arrays A, B, and C is directly allocated on the device by calling the OpenMP runtime routine omp_target_alloc. The routine takes two arguments: the number of bytes to allocate on the device, and the number of the device on which to allocate the storage. The routine returns a device pointer that references the device address of the storage allocated on the device. If the call to omp_target_alloc returns NULL, then this indicates that the allocation was not successful.
To access the allocated memory in a target construct, the device pointer returned by a call to omp_target_alloc is listed in an is_device_ptr clause on the target construct. This ensures that there is no data transfer before and after kernel execution since the kernel operates on data that is already on the device.
At the end of the program, the runtime routine omp_target_free is used to deallocate the storage for A, B, and C on the device.
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { int device_id = omp_get_default_device(); size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays in device memory A = (double *) omp_target_alloc(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc().\n"); exit(1); } B = (double *) omp_target_alloc(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc().\n"); exit(1); } C = (double *) omp_target_alloc(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc().\n"); exit(1); } // Initialize the arrays #pragma omp target teams distribute parallel for \ is_device_ptr(A,B,C) for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation 'iterations' number of times nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ is_device_ptr(A,B,C) for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp target teams distribute parallel for reduction(+:asum) \ map(tofrom: asum) is_device_ptr(C) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } omp_target_free(A, device_id); omp_target_free(B, device_id); omp_target_free(C, device_id); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
Notes:
When calling omp_target_alloc, the device number specified must be one of the supported devices, other than the host device. This will be the device on which storage will be allocated.
Since the arrays A, B, and C are not accessible from the host, the initialization of the arrays, kernel execution, and summation of elements of C all need to be done inside OpenMP target regions.
A device allocation can only be accessed by the device specified in the omp_target_alloc call, but may be copied to memory allocated on the host or other devices by calling omp_target_memcpy.
omp_target_alloc_device
The Intel extension omp_target_alloc_device is similar to omp_target_alloc. It is also called with two arguments: the number of bytes to allocate on the device, and the number of the device on which to allocate the storage. The routine returns a device pointer that references the device address of the storage allocated on the device. If the call to omp_target_alloc_device returns NULL, then this indicates that the allocation was not successful.
The above omp_target_alloc example can be rewritten using omp_target_alloc_device by simply replacing the call to omp_target_alloc with a call to omp_targer_alloc_device as shown below.
At the end of the program, the runtime routine omp_target_free is used to deallocate the storage for A, B, and C on the device.
// Allocate arrays in device memory A = (double *) omp_target_alloc_device(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc_device().\n"); exit(1); } B = (double *) omp_target_alloc_device(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc_device().\n"); exit(1); } C = (double *) omp_target_alloc_device(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc_device().\n"); exit(1); }
Note:
All of the above Notes that apply to omp_target_alloc also apply to omp_target_alloc_device.
omp_target_alloc_host
The above example can also be rewritten by doing a host allocation for A, B, and C. This allows the memory to be accessible to the host and all supported devices.
In the following modified example, the omp_target_alloc_host runtime routine (an Intel extension) is called to allocate storage for each of the arrays A, B, and C. The routine takes two arguments: the number of bytes to allocate, and a device number. The device number must be one of the supported devices, other than the host device. The routine returns a pointer to a storage location in host memory. If the call to omp_target_alloc_host returns NULL, this indicates that the allocation was not successful.
Note the directive requires unified_address is specified at the top of the program. This requires that the implementation guarantee that all devices accessible through OpenMP API routines and directives use a unified address space. In this address space, a pointer will always refer to the same location in memory from all devices, and the is_device_ptr clause is not necessary to obtain device addresses from device pointers for use inside target regions. When using Intel compilers, the requires unified_address directive is actually not needed, since unified address space is guaranteed by default. However, for portability the code includes the directive.
The pointer returned by a call to omp_target_alloc_host can be used to access the storage from the host and all supported devices. No map clauses and no is_device_ptr clauses are needed on a target construct to access the memory from a device since a unified address space is used.
At the end of the program, the runtime routine omp_target_free is used to deallocate the storage for A, B, and C.
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #pragma omp requires unified_address #define iterations 100 #define length 64*1024*1024 int main(void) { int device_id = omp_get_default_device(); size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays in host memory A = (double *) omp_target_alloc_host(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc_host().\n"); exit(1); } B = (double *) omp_target_alloc_host(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc_host().\n"); exit(1); } C = (double *) omp_target_alloc_host(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc_host().\n"); exit(1); } // Initialize the arrays #pragma omp parallel for for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } omp_target_free(A, device_id); omp_target_free(B, device_id); omp_target_free(C, device_id); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
Notes:
When calling omp_target_alloc_host, the device number specified must be one of the supported devices, other than the host device.
Since the arrays A, B, and C are accessible from the host and device, the initialization of the arrays and summation of elements of C may be done either on the host (outside of a target construct) or on the device (inside a target construct).
Intel® Data Center GPU Max Series does not support atomic operations (or algorithms that use atomic operations, such as some reductions) on host allocations (i.e., memory allocated via omp_target_alloc_host). Use atomic operations on memory allocated via omp_target_alloc_device, instead.
omp_target_alloc_shared
The above example is modified so that shared allocations are used instead of host allocations. The omp_target_alloc_shared runtime routine is called to allocate storage for each of arrays A, B, and C. The routine takes two arguments: the number of bytes to allocate on the device, and a device number. The device number must be one of the supported devices, other than the host device. The routine returns a pointer to a storage location in shared memory. If the call to omp_target_alloc_shared returns NULL, then this indicates that the allocation was not successful.
Note the requires unified_address directive is specified at the top of the program, for portability.
The pointer returned by a call to omp_target_alloc_shared can be used to access the storage from the host and all supported devices. No map clauses and no is_device_ptr clauses are needed on a target construct to access the memory from a device since a unified address space is used.
At the end of the program, the runtime routine omp_target_free is used to deallocate the storage for A, B, and C.
// Allocate arrays in shared memory A = (double *) omp_target_alloc_shared(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc_shared().\n"); exit(1); } B = (double *) omp_target_alloc_shared(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc_shared().\n"); exit(1); } C = (double *) omp_target_alloc_shared(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc_shared().\n"); exit(1); }
Notes:
When calling omp_target_alloc_shared, the device number specified must be one of the supported devices, other than the host device.
Since the arrays are accessible from the host and device, the initialization and verification may be done either on the host or on the device (inside a target construct).
Concurrent access from host and device to memory allocated via omp_target_alloc_shared is not supported.
omp_target_memcpy
The following example shows how the runtime routine omp_target_memcpy may be used to copy memory from host to device, and from device to host. First arrays h_A, h_B, and h_C are allocated in system memory using plain malloc, and then initialized. Corresponding arrays d_A, d_B, and d_C are allocated on the device using omp_target_alloc.
Before the start of the target construct on line 104, the values in h_A, h_B, and h_C are copied to d_A, d_B, and d_C by calling omp_target_memcpy. After the target region, new d_C values computed on the device are copied to h_C by calling omp_target_memcpy.
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { int device_id = omp_get_default_device(); int host_id = omp_get_initial_device(); size_t bytes = length*sizeof(double); double * __restrict h_A; double * __restrict h_B; double * __restrict h_C; double * __restrict d_A; double * __restrict d_B; double * __restrict d_C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays h_A, h_B, and h_C on the host using plain malloc() h_A = (double *) malloc(bytes); if (h_A == NULL){ printf(" ERROR: Cannot allocate space for h_A using plain malloc().\n"); exit(1); } h_B = (double *) malloc(bytes); if (h_B == NULL){ printf(" ERROR: Cannot allocate space for h_B using plain malloc().\n"); exit(1); } h_C = (double *) malloc(bytes); if (h_C == NULL){ printf(" ERROR: Cannot allocate space for h_C using plain malloc().\n"); exit(1); } // Allocate arrays d_A, d_B, and d_C on the device using omp_target_alloc() d_A = (double *) omp_target_alloc(bytes, device_id); if (d_A == NULL){ printf(" ERROR: Cannot allocate space for d_A using omp_target_alloc().\n"); exit(1); } d_B = (double *) omp_target_alloc(bytes, device_id); if (d_B == NULL){ printf(" ERROR: Cannot allocate space for d_B using omp_target_alloc().\n"); exit(1); } d_C = (double *) omp_target_alloc(bytes, device_id); if (d_C == NULL){ printf(" ERROR: Cannot allocate space for d_C using omp_target_alloc().\n"); exit(1); } // Initialize the arrays on the host #pragma omp parallel for for (size_t i=0; i<length; i++) { h_A[i] = 2.0; h_B[i] = 2.0; h_C[i] = 0.0; } // Call omp_target_memcpy() to copy values from host to device int rc = 0; rc = omp_target_memcpy(d_A, h_A, bytes, 0, 0, device_id, host_id); if (rc) { printf("ERROR: omp_target_memcpy(A) returned %d\n", rc); exit(1); } rc = omp_target_memcpy(d_B, h_B, bytes, 0, 0, device_id, host_id); if (rc) { printf("ERROR: omp_target_memcpy(B) returned %d\n", rc); exit(1); } rc = omp_target_memcpy(d_C, h_C, bytes, 0, 0, device_id, host_id); if (rc) { printf("ERROR: omp_target_memcpy(C) returned %d\n", rc); exit(1); } // Perform the computation nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ is_device_ptr(d_A,d_B,d_C) for (size_t i=0; i<length; i++) { d_C[i] += d_A[i] + scalar * d_B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Call omp_target_memcpy() to copy values from device to host rc = omp_target_memcpy(h_C, d_C, bytes, 0, 0, host_id, device_id); if (rc) { printf("ERROR: omp_target_memcpy(A) returned %d\n", rc); exit(1); } // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(h_C[i]); } free(h_A); free(h_B); free(h_C); omp_target_free(d_A, device_id); omp_target_free(d_B, device_id); omp_target_free(d_C, device_id); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
Performance Considerations
In the above examples (using the map clause, omp_target_alloc, omp_target_alloc_device, omp_target_alloc_host, omp_target_alloc_shared, omp_target_memcpy), the main kernel is the target construct that computes the values of array C. To get more accurate timings, this target construct is enclosed in a loop, so the offload happens iterations number of times (where iterations = 100). The average kernel time is computed by dividing the total time taken by the iterations loop by 100.
// Perform the computation 'iterations' number of times nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ is_device_ptr(A,B,C) for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time;
LIBOMPTARGET_DEBUG=1 output shows that all the above examples have the same ND_range partitioning.
Target LEVEL0 RTL --> Allocated a device memory 0xff00000020200000 Libomptarget --> omp_target_alloc returns device ptr 0xff00000020200000 Libomptarget --> Call to omp_target_alloc for device 0 requesting 536870912 bytes Libomptarget --> Call to omp_get_num_devices returning 1 Libomptarget --> Call to omp_get_initial_device returning 1 Libomptarget --> Checking whether device 0 is ready. Libomptarget --> Is the device 0 (local ID 0) initialized? 1 Libomptarget --> Device 0 is ready to use.
The following table shows the average times taken by the kernel in the various versions when running on the particular GPU used (1-stack only).
Version |
Time (seconds) |
---|---|
map |
0.183604 |
map + target data |
0.012757 |
omp_target_alloc |
0.002501 |
omp_target_alloc_device |
0.002499 |
omp_target_alloc_host |
0.074412 |
omp_target_alloc_shared |
0.012491 |
omp_target_memcpy |
0.011072 |
The above performance numbers show that the map version is the slowest version (0.183604 seconds). This is because of the data transfers that occur at the beginning and end of each kernel launch. The main kernel is launched 100 times. At the beginning of each kernel launch, storage for arrays A, B and C is allocated on the device, and the values of these arrays are copied from the host to the device. At the end of the kernel, the values of array C are copied from the device to the host. Putting the whole iterations loop inside a target data construct with map clauses reduced the runtime to 0.012757 seconds, because the transfers occur once at the launch of the first kernel in the iterations loop, and again after the last kernel in that loop.
The omp_target_alloc and omp_target_alloc_device versions have the best performance (0.002501 and 0.002499 seconds, respectively). In these versions, storage for A, B, and C is allocated directly in device memory, so accesses on the device happen from device-local memory. This is a useful model for applications that use scratch arrays on the device side. These arrays never need to be accessed on the host. In such cases, the recommendation is to allocate the scratch arrays on the device and not worry about data transfers, as illustrated in this example.
The omp_target_alloc_shared version also performs well, but is somewhat slower (0.012491 seconds). In this version, storage for A, B, and C is allocated in shared memory. So the data can migrate between the host and the device. There is the overhead of migration but, after migration, accesses on the device happen from much faster device-local memory. In this version, the initialization of the arrays happens on the host. At the first kernel launch, the arrays are migrated to the device, and the kernels access the arrays locally on the device. Finally, before the host performs the reduction computation, the entire C array is migrated back to the host.
The omp_target_alloc_host version (0.074412 seconds) takes almost 6x more time than the omp_target_alloc_shared version. This is because data allocated in host memory does not migrate from the host to the device. When the kernel tries to access the data, the data is typically sent over a bus, such as PCI Express, that connects the device to the host. This is slower than accessing local device memory. If the device accesses only a small part of an array infrequently, then that array may be allocated in host memory using omp_target_alloc_host. However, if the array is accessed frequently on the device side, then it should be kept in device memory. Keeping the data in host memory and accessing it over the PCI will degrade performance.
Finally, a note regarding data transfers: The amount of data transferred in the map version can be seen in LIBOMPTARGET_DEBUG=1 output by grepping for "Libomptarget --> Moving". Notice that each launch of the main kernel yields the following data transfers:
$ grep "Libomptarget --> Moving" test_target_map.debug Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a5fc8b010) -> (tgt:0xff00000000200000) Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a9fc8d010) -> (tgt:0xff00000020200000) Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a7fc8c010) -> (tgt:0xff00000040200000) Libomptarget --> Moving 536870912 bytes (tgt:0xff00000000200000) -> (hst:0x00007f1a5fc8b010)
On the other hand, data transfers in the omp_target_alloc_... versions are handled by a lower layer of the runtime system. So grepping for "Libomptarget --> Moving" in LIBOMPTARGET_DEBUG=1 output for these versions will not show the data transfers that took place.
Fortran Examples
The Fortran version of the example using target data and map clauses is shown below.
program main use iso_fortran_env use omp_lib implicit none integer, parameter :: iterations=100 integer, parameter :: length=64*1024*1024 real(kind=REAL64), parameter :: epsilon=1.D-8 real(kind=REAL64), allocatable :: A(:) real(kind=REAL64), allocatable :: B(:) real(kind=REAL64), allocatable :: C(:) real(kind=REAL64) :: scalar=3.0 real(kind=REAL64) :: ar, br, cr, asum real(kind=REAL64) :: nstream_time, avgtime integer :: err, i, iter ! ! Allocate arrays on the host using plain allocate allocate( A(length), stat=err ) if (err .ne. 0) then print *, "Allocation of A returned ", err stop 1 endif allocate( B(length), stat=err ) if (err .ne. 0) then print *, "Allocation of B returned ", err stop 1 endif allocate( C(length), stat=err ) if (err .ne. 0) then print *, "Allocation of C returned ", err stop 1 endif ! ! Initialize the arrays !$omp parallel do do i = 1, length A(i) = 2.0 B(i) = 2.0 C(i) = 0.0 end do ! ! Perform the computation nstream_time = omp_get_wtime() !$omp target data map(to: A, B) map(tofrom: C) do iter = 1, iterations !$omp target teams distribute parallel do do i = 1, length C(i) = C(i) + A(i) + scalar * B(i) end do end do !$omp end target data nstream_time = omp_get_wtime() - nstream_time ! ! Validate and output results ar = 2.0 br = 2.0 cr = 0.0 do iter = 1, iterations do i = 1, length cr = cr + ar + scalar * br end do end do asum = 0.0 !$omp parallel do reduction(+:asum) do i = 1, length asum = asum + abs(C(i)) end do if (abs(cr - asum)/asum > epsilon) then write(*,110) "Failed Validation on output array: Expected =", cr, ", Observed =", asum else avgtime = nstream_time/iterations write(*,120) "Solution validates: Checksum =", asum, ", Avg time (s) =", avgtime endif 110 format (A, F20.6, A, F20.6) 120 format (A, F20.6, A, F10.6) deallocate(A) deallocate(B) deallocate(C) end program main
The Fortran version of the example using omp_target_alloc_device is shown below. In this example, allocate directives, with the allocator omp_target_device_mem_alloc, are used to allocate arrays A, B, and C on the device. The use_device_addr(A, B, C) clause is used on the target data directive (line 37) to indicate that the arrays have device addresses, and these addresses should be used in the target region.
use iso_fortran_env use omp_lib implicit none integer, parameter :: iterations=100 integer, parameter :: length=64*1024*1024 real(kind=REAL64), parameter :: epsilon=1.D-8 real(kind=REAL64), allocatable :: A(:) real(kind=REAL64), allocatable :: B(:) real(kind=REAL64), allocatable :: C(:) real(kind=REAL64) :: scalar=3.0 real(kind=REAL64) :: ar, br, cr, asum real(kind=REAL64) :: nstream_time, avgtime integer :: i, iter ! ! Allocate arrays in device memory !$omp allocators allocate(allocator(omp_target_device_mem_alloc): A) allocate(A(length)) !$omp allocators allocate(allocator(omp_target_device_mem_alloc): B) allocate(B(length)) !$omp allocators allocate(allocator(omp_target_device_mem_alloc): C) allocate(C(length)) ! ! Begin target data !$omp target data use_device_addr(A, B, C) ! ! Initialize the arrays !$omp target teams distribute parallel do do i = 1, length A(i) = 2.0 B(i) = 2.0 C(i) = 0.0 end do ! ! Perform the computation nstream_time = omp_get_wtime() do iter = 1, iterations !$omp target teams distribute parallel do do i = 1, length C(i) = C(i) + A(i) + scalar * B(i) end do end do nstream_time = omp_get_wtime() - nstream_time ! ! Validate and output results ar = 2.0 br = 2.0 cr = 0.0 do iter = 1, iterations do i = 1, length cr = cr + ar + scalar * br end do end do asum = 0.0 !$omp target teams distribute parallel do reduction(+:asum) & !$omp map(tofrom: asum) do i = 1, length asum = asum + abs(C(i)) end do ! ! End target data !$omp end target data if (abs(cr - asum)/asum > epsilon) then write(*,110) "Failed Validation on output array: Expected =", cr, ", Observed =", asum else avgtime = nstream_time/iterations write(*,120) "Solution validates: Checksum =", asum, ", Avg time (s) =", avgtime endif 110 format (A, F20.6, A, F20.6) 120 format (A, F20.6, A, F10.6) deallocate(A) deallocate(B) deallocate(C) end program main