Removing Conditional Checks

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 4/18/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-802F4D9E-29E5-4AEA-AEA3-82635F20EC4E

View Details

Removing Conditional Checks

In Sub-Groups and SIMD Vectorization, we learned that SIMD divergence can negatively affect performance. If all work items in a sub-group execute the same instruction, the SIMD lanes are maximally utilized. If one or more work items take a divergent path, then both paths have to be executed before they merge.

Divergence is caused by conditional checks, though not all conditional checks cause divergence. Some conditional checks, even when they do not cause SIMD divergence, can still be performance hazards. In general, removing conditional checks can help performance.

Padding Buffers to Remove Conditional Checks

Look at the convolution example from Shared Local Memory:

    sycl::buffer<int> ibuf(input.data(), N);
    sycl::buffer<int> obuf(output.data(), N);
    sycl::buffer<int> kbuf(kernel.data(), M);

    auto e = q.submit([&](auto &h) {
      sycl::accessor iacc(ibuf, h, sycl::read_only);
      sycl::accessor oacc(obuf, h);
      sycl::accessor kacc(kbuf, h, sycl::read_only);

      h.parallel_for(sycl::nd_range<1>(sycl::range{N}, sycl::range{256}),
                     [=](sycl::nd_item<1> it) {
                       int i = it.get_global_linear_id();
                       int group = it.get_group()[0];
                       int gSize = it.get_local_range()[0];

                       int t = 0;
                       int _M = static_cast<int>(M);
                       int _N = static_cast<int>(N);

                       if ((group == 0) || (group == _N / gSize - 1)) {
                         if (i < _M / 2) {
                           for (int j = _M / 2 - i, k = 0; j < _M; ++j, ++k) {
                             t += iacc[k] * kacc[j];
                           }
                         } else {
                           if (i + _M / 2 >= _N) {
                             for (int j = 0, k = i - _M / 2;
                                  j < _M / 2 + _N - i; ++j, ++k) {
                               t += iacc[k] * kacc[j];
                             }
                           } else {
                             for (int j = 0, k = i - _M / 2; j < _M; ++j, ++k) {
                               t += iacc[k] * kacc[j];
                             }
                           }
                         }
                       } else {
                         for (int j = 0, k = i - _M / 2; j < _M; ++j, ++k) {
                           t += iacc[k] * kacc[j];
                         }
                       }

                       oacc[i] = t;
                     });
    });

The nested if-then-else conditional checks are necessary to take care of the first and last 128 elements in the input so indexing will not run out of bounds. If we pad enough 0s before and after the input array, these conditional checks can be safely removed:

  std::vector<int> input(N + M / 2 + M / 2);
  std::vector<int> output(N);
  std::vector<int> kernel(M);

  srand(2009);
  for (size_t i = M / 2; i < N + M / 2; ++i) {
    input[i] = rand();
  }

  for (size_t i = 0; i < M / 2; ++i) {
    input[i] = 0;
    input[i + N + M / 2] = 0;
  }

  for (size_t i = 0; i < M; ++i) {
    kernel[i] = rand();
  }

  {
    sycl::buffer<int> ibuf(input.data(), N + M / 2 + M / 2);
    sycl::buffer<int> obuf(output.data(), N);
    sycl::buffer<int> kbuf(kernel.data(), M);

    auto e = q.submit([&](auto &h) {
      sycl::accessor iacc(ibuf, h, sycl::read_only);
      sycl::accessor oacc(obuf, h);
      sycl::accessor kacc(kbuf, h, sycl::read_only);

      h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{256}),
                     [=](sycl::nd_item<1> it) {
                       int i = it.get_global_linear_id();
                       int t = 0;

                       for (size_t j = 0; j < M; ++j) {
                         t += iacc[i + j] * kacc[j];
                       }

                       oacc[i] = t;
                     });
    });
    q.wait();

    size_t kernel_ns = (e.template get_profiling_info<
                            sycl::info::event_profiling::command_end>() -
                        e.template get_profiling_info<
                            sycl::info::event_profiling::command_start>());
    std::cout << "Kernel Execution Time Average: total = " << kernel_ns * 1e-6
              << " msec" << std::endl;
  }

Replacing Conditional Checks with Relational Functions

Another way to remove conditional checks is to replace them with relational functions, especially built-in relational functions. It is strongly recommended to use a built-in function if one is available. SYCL provides a rich set of built-in relational functions like select(), min(), max(). In many cases you can use these functions to replace conditional checks and achieve better performance.

Consider the convolution example again. The if-then-else conditional checks can be replaced with built-in functions min() and max().

    sycl::buffer<int> ibuf(input.data(), N);
    sycl::buffer<int> obuf(output.data(), N);
    sycl::buffer<int> kbuf(kernel.data(), M);

    auto e = q.submit([&](auto &h) {
      sycl::accessor iacc(ibuf, h, sycl::read_only);
      sycl::accessor oacc(obuf, h);
      sycl::accessor kacc(kbuf, h, sycl::read_only);

      h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{256}),
                     [=](sycl::nd_item<1> it) {
                       int i = it.get_global_linear_id();
                       int t = 0;
                       int startj = sycl::max<int>(M / 2 - i, 0);
                       int endj = sycl::min<int>(M / 2 + N - i, M);
                       int startk = sycl::max<int>(i - M / 2, 0);
                       for (int j = startj, k = startk; j < endj; j++, k++) {
                         t += iacc[k] * kacc[j];
                       }
                       oacc[i] = t;
                     });
    });

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Removing Conditional Checks

Padding Buffers to Remove Conditional Checks

Replacing Conditional Checks with Relational Functions