Visible to Intel only — GUID: GUID-E64137E3-CD41-4C4F-B192-A28FF2318499
Visible to Intel only — GUID: GUID-E64137E3-CD41-4C4F-B192-A28FF2318499
Work Scheduling and Memory Distribution
The root device driver uses deterministic heuristics to distribute work-groups and memory pages to all stacks when implicit scaling is used. These heuristics are described in the next two sections.
Memory Coloring
Any allocation in SYCL/OpenMP that corresponds to a shared or device allocation is colored across all stacks, meaning that allocation is divided into number-of-stacks chunks and distributed round-robin between the stacks. Consider this root device allocation:
OpenMP:
int *a = (int*)omp_target_alloc( sizeof(int)*N, device_id );
SYCL:
int *a = sycl::malloc_device<int>(N, q);
For a 2-stack root device, the first half, (elements a[0] to a[N/2-1]), is physically allocated on stack 0. The remaining half, (elements a[N/2] to a[N-1]), is located on stack 1. In future, we will introduce memory allocation APIs that allow user-defined memory coloring.
Note:
The memory coloring described above is applied at page size-granularity. An allocation containing three pages has two pages resident on stack 0.
Allocations smaller than or equal to page-size are resident on stack 0 only.
Using a memory pool that is based on a single allocation will break memory coloring logic. It is recommended that applications create one allocation per object to allow the object data to be distributed among all stacks.
Static Partitioning
Scheduling of work-groups to stacks is deterministic and referred to as static partitioning. The partitioning follows a simple rule: the slowest moving dimension is divided in number-of-stacks chunks and distributed round-robin between stacks. Let’s look the following at 1-dimensional kernel launch on the root device:
OpenMP:
#pragma omp target teams distribute parallel for simd
for (int i = 0; i < N; ++i)
{
//
}
SYCL:
q.parallel_for(N, [=](auto i) {
//
});
Since there is only a single dimension it is, automatically the slowest moving dimension and partitioned between stacks by the driver. For a 2-stack root device, iterations 0 to N/2-1 are executed on stack 0. The remaining iterations N/2 to N-1 are executed on stack 1.
For OpenMP, the slowest moving dimension is the outermost loop when the collapse clause is used. For SYCL, the slowest moving dimension is the first element of global range. For example, consider this 3D kernel launch:
OpenMP:
#pragma omp target teams distribute parallel for simd collapse(3)
for (int z = 0; z < nz; ++z)
{
for (int y = 0; y < ny; ++y)
{
for (int x = 0; x < nx; ++x)
{
//
}
}
}
SYCL:
range<3> global{nz, ny, nx};
range<3> local{1, 1, 16};
cgh.parallel_for(nd_range<3>(global, local), [=](nd_item<3> item) {
//
});
The slowest moving dimension is z and is partitioned between the stacks. That is, for a 2-stack root device, all iterations from z = 0 to z = nz/2 - 1 are executed on stack 0. The remaining iterations from z = nz/2 to z = nz-1 are executed on stack 1.
In case the slowest moving dimension cannot be divided evenly between the stacks and there is a load imbalance that is larger than 5%, the driver will partition the next dimension if it leads to less load imbalance. This impacts kernels with odd dimensions smaller than 19 only. Examples for different kernel launches can be seen in the table below (assuming local range {1,1,16}):
nz |
ny |
nx |
Partitioned Dimension |
---|---|---|---|
512 |
512 |
512 |
z |
21 |
512 |
512 |
z |
19 |
512 |
512 |
y |
18 |
512 |
512 |
z |
19 |
19 |
512 |
x |
In case of multi-dimensional local range in SYCL, the partitioned dimension can change. For example, for global range {38,512,512} with local range {2,1,8} the driver would partition the y-dimension, while for local range {1,1,16} the driver would partition the z-dimension.
OpenMP can only have a 1-dimensional local range which is created from the innermost loop, and thus does not impact static partitioning heuristics. OpenMP kernels created with a collapse level larger than 3 correspond to a 1-dimensional kernel with all the for loops linearized. The linearized loop will be partitioned following 1D kernel launch heuristics.
Notes:
Static partitioning happens at work-group granularity. This implies that all work-items in a work-group are scheduled to the same stack.
A kernel with a single work-group is resident on stack 0 only.