Visible to Intel only — GUID: GUID-22AB0AB6-0489-4939-81C1-C7CD4073E348
Visible to Intel only — GUID: GUID-22AB0AB6-0489-4939-81C1-C7CD4073E348
Programming Principles
To achieve good performance with implicit scaling, cross-stack memory accesses must be minimized but it is not required to eliminate all cross-stack accesses. A certain amount of cross-stack traffic can be handled by stack-to-stack interconnect if performed concurrently with local memory accesses. For memory bandwidth bound workload the amount of acceptable cross-stack accesses is determined by ratio of local memory bandwidth and cross-stack bandwidth (see Cross-Stack Traffic).
The following principles should be embraced by workloads that use implicit scaling:
Kernel must have enough work-items to utilize both stacks.
The minimal number of work-items needed to utilize both stacks is: <number of VEs> * <hardware-threads per VE> * <SIMD width>.
2 stack Intel® Data Center GPU Max with 1024 VE and SIMD32 requires at least 262,144 work-items.
Device time must dominate runtime to observe whole application scaling.
Minimize cross-stack memory accesses by exploiting locality in algorithm.
Slowest moving dimension should be large to avoid stack load imbalance.
Cross-stack memory accesses and local memory accesses should be interleaved.
Avoid stride-1 memory access in slowest moving dimension for 2D and 3D kernel launches.
If memory access pattern changes dynamically over time, a sorting step every nth iteration should be performed to minimize cross-stack memory accesses.
Don’t use a memory pool based on a single allocation (see Memory Coloring).
Many applications naturally have a concept of locality. These applications are expected to be a good fit for using implicit scaling due to low cross-stack traffic. To illustrate this concept, let’s use a stencil kernel as an example. A stencil operates on a grid which can be divided into blocks where majority of stencil computations within a block use stack local data. Only stencil operations that are at border of the block require data from another block, i.e. on another stack. The amount of these cross-stack/cross-border accesses are suppressed by halo to local volume ratio. This concept is illustrated below:
Cross-Stack Traffic
As mentioned in last section, it is crucial to minimize cross-stack traffic. To guide how much traffic can be tolerated without significantly impacting application performance we can benchmark STREAM kernel with varying amount of cross-stack traffic and compare to stack-local STREAM performance. The worst case is 100% cross-stack traffic. This is generated by reversing loop order in STREAM kernel (see STREAM):
#pragma omp target teams distribute parallel for simd for (int i = N - 1; i <= 0; --i) { c[i] = a[i] + b[i]; }
Here, each stack has 100% cross-stack memory traffic as work-groups on stack-0 access array elements N-1 to N/2 which are located in stack-1 memory. This kernel essentially benchmarks stack-to-stack bi-directional bandwidth. This approach can be generalized to interpolate between 0% cross-stack accesses and 100% cross-stack accesses by modified STREAM below:
// Code for cross stack stream #include <iostream> #include <omp.h> // compile via: // icpx -O2 -fiopenmp -fopenmp-targets=spir64 ./stream_cross_stack.cpp // run via: // EnableWalkerPartition=1 ZE_AFFINITY_MASK=0 ./a.out template <int cross_stack_fraction> void cross_stack_stream() { constexpr int64_t size = 256*1e6; constexpr int64_t bytes = size * sizeof(int64_t); int64_t *a = static_cast<int64_t*>(malloc( bytes )); int64_t *b = static_cast<int64_t*>(malloc( bytes )); int64_t *c = static_cast<int64_t*>(malloc( bytes )); #pragma omp target enter data map( alloc:a[0:size] ) #pragma omp target enter data map( alloc:b[0:size] ) #pragma omp target enter data map( alloc:c[0:size] ) for ( int i = 0; i < size; ++i ) { a[i] = i + 1; b[i] = i - 1; c[i] = 0; } #pragma omp target update to( a[0:size] ) #pragma omp target update to( b[0:size] ) #pragma omp target update to( c[0:size] ) const int num_max_rep = 100; double time; for ( int irep = 0; irep < num_max_rep+10; ++irep ) { if ( irep == 10 ) time = omp_get_wtime(); #pragma omp target teams distribute parallel for simd for ( int j = 0; j < size; ++j ) { const int cache_line_id = j / 16; int i; if ( (cache_line_id%cross_stack_fraction) == 0 ) { i = (j+size/2)%size; } else { i = j; } c[i] = a[i] + b[i]; } } time = omp_get_wtime() - time; time = time/num_max_rep; #pragma omp target update from( c[0:size] ) for ( int i = 0; i < size; ++i ) { if ( c[i] != 2*i ) { std::cout << "wrong results!" << std::endl; exit(1); } } const int64_t streamed_bytes = 3 * size * sizeof(int64_t); std::cout << "cross_stack_percent = " << (1/(double)cross_stack_fraction)*100 << "%, bandwidth = " << (streamed_bytes/time) * 1E-9 << " GB/s" << std::endl; } int main() { cross_stack_stream< 1>(); cross_stack_stream< 2>(); cross_stack_stream< 4>(); cross_stack_stream< 8>(); cross_stack_stream<16>(); cross_stack_stream<32>(); }
The kernel on line 48-65 accesses every cross_stack_fraction'th cache line cross-stack by offsetting array access with (j+N/2)%N. For cross_stack_fraction==1, we generate 100% cross-stack memory accesses. By doubling cross_stack_fraction we decrease cross-stack traffic by a factor of 2. Note that this kernel is written such that cross-stack and local memory accesses are interleaved within work-groups to maximize hardware utilization. Measured performance on 2-stack Intel® Data Center GPU Max with 2 GB array size can be seen below:
Partial cross-stack STREAM bandwidth [GB/s] |
cross_stack_fraction |
% of cross-stack accesses |
% of max local 2-stack STREAM bandwidth |
---|---|---|---|
355 |
1 |
100% |
17% |
696 |
2 |
50% |
33% |
1223 |
4 |
25% |
58% |
1450 |
8 |
12.5% |
69% |
1848 |
16 |
6.25% |
87% |
2108 |
32 |
3.125% |
99% |
As can be seen in above table, applications should try to limit cross-stack traffic to be less than 10% of all memory traffic to avoid a significant drop in sustained memory bandwidth. For STREAM with of 12.5% cross-stack accesses we measure about 69% of the bandwidth of a local STREAM benchmark. These numbers can be used to estimate impact of cross-stack memory accesses on application kernel execution time.