Visible to Intel only — GUID: GUID-6B8963D3-22DC-4105-9B08-F710482B6B18
Visible to Intel only — GUID: GUID-6B8963D3-22DC-4105-9B08-F710482B6B18
Work Scheduling and Memory Distribution
The root-device driver uses deterministic heuristics to distribute work-groups and memory pages to all stacks when implicit scaling is used. These heuristics are described in the next two sections.
Memory Coloring
Any allocation in SYCL/OpenMP that corresponds to a shared or device allocation is colored across all stacks, meaning that allocation is divided in number-of-stacks chunks and distributed round-robin between stacks. Consider this root-device allocation:
OpenMP:
int *a = (int*)omp_target_alloc( sizeof(int)*N, device_id );
SYCL:
int *a = sycl::malloc_device<int>(N, q);
For a 2-stack root-device, the first half, (elements a[0] to a[N/2-1]), is physically allocated on stack-0. The remaining half, (elements a[N/2] to a[N-1]), is located on stack-1. In the future, we will introduce memory allocation APIs that allow user-defined memory coloring.
Note:
Memory coloring described above is applied at page size granularity.
An allocation containing three pages has two pages resident on stack-0.
Allocations smaller or equal than page size are resident on stack-0 only.
Using a memory pool that is based on a single allocation will break memory coloring logic. It is recommended that applications create one allocation per object to allow that object data is distributed to all stacks.
Static Partitioning
Scheduling of work-groups to stacks is deterministic and referred to as static partitioning. The partitioning follows a simple rule: the slowest moving dimension is divided in number-of-stacks chunks and distributed round-robin between stacks. Let’s look at 1-dimensional kernel launch on root-device:
OpenMP:
#pragma omp target teams distribute parallel for simd for (int i = 0; i < N; ++i) { // }
SYCL:
q.parallel_for(N, [=](auto i) { // });
Since there is only a single dimension it is automatically slowest dimension and partitioned between stacks by driver. For a 2-stack root-device, iterations 0 to N/2-1 are scheduled to stack-0. The remaining iterations N/2 to N-1 are executed on stack-1.
For OpenMP, the slowest moving dimension is outer most loop when collapse clause is used. For SYCL, the slowest moving dimension is the first element of global range. E.g. consider this 3D kernel launch:
OpenMP:
#pragma omp target teams distribute parallel for simd collapse(3) for (int z = 0; z < nz; ++z) { for (int y = 0; y < ny; ++y) { for (int x = 0; x < nx; ++x) { // } } }
SYCL:
range<3> global{nz, ny, nx}; range<3> local{1, 1, 16}; cgh.parallel_for(nd_range<3>(global, local), [=](nd_item<3> item) { // });
The slowest dimension is z and partitioned between stacks, i.e. for 2-stack root-device, all iterations from z=0 to z=nz/2-1 are executed on stack 0. The remaining iterations with z=nz/2 to z=nz-1 are scheduled to stack 1.
In case slowest moving dimension can’t be divided evenly between stacks and creates an remainder imbalance larger than 5%, driver will partition next dimension if it leads to less load imbalance. This impacts kernels with odd dimensions smaller than 19 only. Examples for different kernel launches can be seen in below table (assuming local range {1,1,16}):
nz |
ny |
nx |
Partitioned Dimension |
---|---|---|---|
512 |
512 |
512 |
z |
21 |
512 |
512 |
z |
19 |
512 |
512 |
y |
18 |
512 |
512 |
z |
19 |
19 |
512 |
x |
In case of multi-dimensional local range in SYCL, the partitioned dimension can change. For example, for global range {38,512,512} with local range {2,1,8} driver would partition y-dimension while for local range {1,1,16} driver would partition z-dimension. OpenMP can only have a 1-dimensional local range which is created from inner most loop and thus does not impact static partitioning heuristics. OpenMP kernels created with collapse level larger than 3 correspond to 1-dimensional kernel with all for loops linearized. The linearized loop will be portioned following 1D kernel launch heuristics.
Note:
Static partitioning happens at work-group granularity.
This implies that all work-items in a work-group are scheduled to same stack.
A kernel with a single work-group is resident on stack-0 only.