Visible to Intel only — GUID: GUID-E343F42B-7A15-47BA-AF9E-3E24EAFBD2D8
Visible to Intel only — GUID: GUID-E343F42B-7A15-47BA-AF9E-3E24EAFBD2D8
DPCT1085
Message
The function <function name> requires sub-group size to be <size>, while other sub-group functions in the same SYCL kernel require a different sub-group size. You may need to adjust the code.
Detailed Help
Each kernel can only be decorated with one sub-group size. This warning is emitted when a kernel requires different sub-group sizes. Check if the sub-group size can be unified into one value, and if it cannot be unified, redesign the code logic.
For example, this original CUDA* code:
__global__ void kernel(int* data1, int* data2) { typedef cub::WarpScan<int> WarpScan; typedef cub::WarpScan<int, 16> WarpScan16; typename WarpScan::TempStorage temp1; typename WarpScan16::TempStorage temp2; int input = data1[threadIdx.x]; int output1 = 0; int output2 = 0; WarpScan(temp1).InclusiveSum(input, output1); data1[threadIdx.x] = output1; WarpScan16(temp2).InclusiveSum(input, output2); data2[threadIdx.x] = output1; } void foo(int* data1, int* data2) { kernel<<<1, 32>>>(data1, data2); }
results in the following migrated SYCL* code:
void kernel(int* data1, int* data2, const sycl::nd_item<3> &item_ct1) { int input = data1[item_ct1.get_local_id(2)]; int output1 = 0; int output2 = 0; output1 = sycl::inclusive_scan_over_group(item_ct1.get_sub_group(), input, sycl::plus<>()); data1[item_ct1.get_local_id(2)] = output1; /* DPCT1085:0: The function inclusive_scan_over_group requires sub-group size to be 16, while other sub-group functions in the same SYCL kernel require a different sub-group size. You may need to adjust the code. */ output2 = sycl::inclusive_scan_over_group(item_ct1.get_sub_group(), input, sycl::plus<>()); data2[item_ct1.get_local_id(2)] = output1; } void foo(int* data1, int* data2) { dpct::get_in_order_queue().parallel_for( sycl::nd_range<3>(sycl::range<3>(1, 1, 32), sycl::range<3>(1, 1, 32)), [=](sycl::nd_item<3> item_ct1) [[intel::reqd_sub_group_size(32)]] { kernel(data1, data2, item_ct1); }); } void foo(int* data) { dpct::get_in_order_queue().parallel_for( sycl::nd_range<3>(sycl::range<3>(1, 1, 32), sycl::range<3>(1, 1, 32)), [=](sycl::nd_item<3> item_ct1) [[intel::reqd_sub_group_size(32)]] { kernel(data, item_ct1); }); }
which is manually adjusted to:
void kernel(int* data1, int* data2, const sycl::nd_item<3> &item_ct1) { int input = data1[item_ct1.get_local_id(2)]; int output1 = 0; int output2 = 0; output1 = sycl::inclusive_scan_over_group(item_ct1.get_sub_group(), input, sycl::plus<>()); data1[item_ct1.get_local_id(2)] = output1; output2 = sycl::inclusive_scan_over_group(item_ct1.get_sub_group(), input, sycl::plus<>()); data2[item_ct1.get_local_id(2)] = output1; item_ct1.barrier(); if (item_ct1.get_local_id(2) % 32 >= 16) { int warp_id = item_ct1.get_local_id(2) / 32; data2[item_ct1.get_local_id(2)] -= data2[warp_id * 32 + 15]; } } void foo(int* data1, int* data2) { dpct::get_in_order_queue().parallel_for( sycl::nd_range<3>(sycl::range<3>(1, 1, 32), sycl::range<3>(1, 1, 32)), [=](sycl::nd_item<3> item_ct1) [[intel::reqd_sub_group_size(32)]] { kernel(data1, data2, item_ct1); }); }
Suggestions to Fix
Code requires manual fix. Rewrite the code manually.