Work Scheduling and Memory Distribution

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Work Scheduling and Memory Distribution

The root-device driver uses deterministic heuristics to distribute work-groups and memory pages to all stacks when implicit scaling is used. These heuristics are described in the next two sections.

Memory Coloring

Any allocation in SYCL/OpenMP that corresponds to a shared or device allocation is colored across all stacks, meaning that allocation is divided in number-of-stacks chunks and distributed round-robin between stacks. Consider this root-device allocation:

OpenMP:

int *a = (int*)omp_target_alloc( sizeof(int)*N, device_id );

SYCL:

int *a = sycl::malloc_device<int>(N, q);

For a 2-stack root-device, the first half, (elements a[0] to a[N/2-1]), is physically allocated on stack-0. The remaining half, (elements a[N/2] to a[N-1]), is located on stack-1. In the future, we will introduce memory allocation APIs that allow user-defined memory coloring.

Note:

Memory coloring described above is applied at page size granularity.
- An allocation containing three pages has two pages resident on stack-0.
Allocations smaller or equal than page size are resident on stack-0 only.
Using a memory pool that is based on a single allocation will break memory coloring logic. It is recommended that applications create one allocation per object to allow that object data is distributed to all stacks.

Static Partitioning

Scheduling of work-groups to stacks is deterministic and referred to as static partitioning. The partitioning follows a simple rule: the slowest moving dimension is divided in number-of-stacks chunks and distributed round-robin between stacks. Let’s look at 1-dimensional kernel launch on root-device:

OpenMP:

#pragma omp target teams distribute parallel for simd
for (int i = 0; i < N; ++i)
{
    //
}

SYCL:

q.parallel_for(N, [=](auto i) {
    //
});

Since there is only a single dimension it is automatically slowest dimension and partitioned between stacks by driver. For a 2-stack root-device, iterations 0 to N/2-1 are scheduled to stack-0. The remaining iterations N/2 to N-1 are executed on stack-1.

For OpenMP, the slowest moving dimension is outer most loop when collapse clause is used. For SYCL, the slowest moving dimension is the first element of global range. E.g. consider this 3D kernel launch:

OpenMP:

#pragma omp target teams distribute parallel for simd collapse(3)
for (int z = 0; z < nz; ++z)
{
    for (int y = 0; y < ny; ++y)
    {
        for (int x = 0; x < nx; ++x)
        {
            //
        }
    }
}

SYCL:

range<3> global{nz, ny, nx};
range<3> local{1, 1, 16};

cgh.parallel_for(nd_range<3>(global, local), [=](nd_item<3> item) {
    //
});

The slowest dimension is z and partitioned between stacks, i.e. for 2-stack root-device, all iterations from z=0 to z=nz/2-1 are executed on stack 0. The remaining iterations with z=nz/2 to z=nz-1 are scheduled to stack 1.

In case slowest moving dimension can’t be divided evenly between stacks and creates an remainder imbalance larger than 5%, driver will partition next dimension if it leads to less load imbalance. This impacts kernels with odd dimensions smaller than 19 only. Examples for different kernel launches can be seen in below table (assuming local range {1,1,16}):

Work Group Partition to Stacks
`nz`	`ny`	`nx`	Partitioned Dimension
512	512	512	z
21	512	512	z
19	512	512	y
18	512	512	z
19	19	512	x

In case of multi-dimensional local range in SYCL, the partitioned dimension can change. For example, for global range {38,512,512} with local range {2,1,8} driver would partition y-dimension while for local range {1,1,16} driver would partition z-dimension. OpenMP can only have a 1-dimensional local range which is created from inner most loop and thus does not impact static partitioning heuristics. OpenMP kernels created with collapse level larger than 3 correspond to 1-dimensional kernel with all for loops linearized. The linearized loop will be portioned following 1D kernel launch heuristics.

Note:

Static partitioning happens at work-group granularity.
- This implies that all work-items in a work-group are scheduled to same stack.
A kernel with a single work-group is resident on stack-0 only.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Work Scheduling and Memory Distribution

Memory Coloring

Static Partitioning