DPCT1110

Intel® DPC++ Compatibility Tool Developer Guide and Reference

Download PDF

ID 768918

Date 10/31/2024

Version

Public

DPCT1110

Message

The total declared local variable size in device function <function name> exceeds 128 bytes and may cause high register pressure. Consult with your hardware vendor to find the total register size available and adjust the code, or use smaller sub-group size to avoid high register pressure.

Detailed Help

In specific hardware configurations, the number of registers available for each work-item is limited. For instance, in the Intel Xe-LP GPU architecture, each hardware thread has 4KB of registers. Consequently, if the sub-group size is 32, then each work-item can utilize 128 bytes of registers (4KB/32). If the declared local variable size in a device function exceeds 128 bytes, some variables may be stored in local or global memory, potentially leading to reduced performance when frequently accessed. To address this issue, you can either decrease the sub-group size to make more registers available for each work-item, or follow the recommendations in the Optimizing Register Spills section of the oneAPI GPU Optimization Guide. For other hardware, please consult with your hardware vendor to get configuration information.

Suggestions to Fix

For example, this original CUDA* code:


__global__ void Kernel(){
  int result[50];
  ...
}
int main{
  ...
  Kernel<<<1, 100>>>();
}

results in the following migrated SYCL* code:


/*
DPCT1110:0: The total declared local variable size in device function "Kernel" exceeds 128 bytes and may cause high register pressure. Consult with your hardware vendor to find the total register size available and adjust the code or use smaller    sub-group size to avoid high register pressure.
*/
void Kernel(){
  int result[50];
  ...
}
int main{
  ...
  q.parallel_for(sycl::range(100), [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(32)]] { Kernel(); });
}

which is rewritten to:


void Kernel(){
  int result[50];
  ...
}

int main{
  ...
  /*
   Reduce sub_group size to make more registers available for each work-item, which may help to avoid high register pressure.
  */
  q.parallel_for(sycl::range(100), [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(16)]] { Kernel(); });
}

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® DPC++ Compatibility Tool Developer Guide and Reference

DPCT1110

Message

Detailed Help

Suggestions to Fix