Visible to Intel only — GUID: GUID-088C9C34-0435-4730-AA44-D09C3EAC0A03
Visible to Intel only — GUID: GUID-088C9C34-0435-4730-AA44-D09C3EAC0A03
DPCT1110
Message
The total declared local variable size in device function <function name> exceeds 128 bytes and may cause high register pressure. Consult with your hardware vendor to find the total register size available and adjust the code, or use smaller sub-group size to avoid high register pressure.
Detailed Help
In specific hardware configurations, the number of registers available for each work-item is limited. For instance, in the Intel Xe-LP GPU architecture, each hardware thread has 4KB of registers. Consequently, if the sub-group size is 32, then each work-item can utilize 128 bytes of registers (4KB/32). If the declared local variable size in a device function exceeds 128 bytes, some variables may be stored in local or global memory, potentially leading to reduced performance when frequently accessed. To address this issue, you can either decrease the sub-group size to make more registers available for each work-item, or follow the recommendations in the Optimizing Register Spills section of the oneAPI GPU Optimization Guide. For other hardware, please consult with your hardware vendor to get configuration information.
Suggestions to Fix
For example, this original CUDA* code:
__global__ void Kernel(){
int result[50];
...
}
int main{
...
Kernel<<<1, 100>>>();
}
results in the following migrated SYCL* code:
/*
DPCT1110:0: The total declared local variable size in device function "Kernel" exceeds 128 bytes and may cause high register pressure. Consult with your hardware vendor to find the total register size available and adjust the code or use smaller sub-group size to avoid high register pressure.
*/
void Kernel(){
int result[50];
...
}
int main{
...
q.parallel_for(sycl::range(100), [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(32)]] { Kernel(); });
}
which is rewritten to:
void Kernel(){
int result[50];
...
}
int main{
...
/*
Reduce sub_group size to make more registers available for each work-item, which may help to avoid high register pressure.
*/
q.parallel_for(sycl::range(100), [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(16)]] { Kernel(); });
}