Migrating C++ Thrust* Applications to SYCL* and oneDPL

ID 766981
Updated 12/14/2022
Version Latest
Public

author-image

By

Using oneAPI to Modernize and Accelerate C++ Applications
While Freeing Them from Vendor Lock-In

Programmers love their libraries. Not just because we’re lazy, but because libraries boost productivity and performance. Why waste time optimizing common patterns, only to do it again when the target hardware changes? You already did the work of identifying the patterns, so why not use predefined abstractions like the ISO C++ Standard Template Library (STL)? They’re ready to use and optimized by experts.

The STL is great if you’re only targeting CPUs, but the computing landscape is heterogeneous. You need options for more than just CPUs. Two such options are the NVIDIA* Thrust library and the oneAPI Data Parallel C++ Library (oneDPL). How do they compare? Both are open-source projects [1][2], but Thrust’s device support depends on proprietary software components (i.e., it only supports GPUs through CUB*/CUDA*). This means vendor lock-in. On the other hand, oneDPL is based on SYCL, which by design enables support for accelerators from multiple vendors [3].

It’s natural to ask how best to migrate from Thrust to the nonproprietary oneDPL. Let’s address this question by looking at two complementary migration strategies, a tool-assisted option and manual recoding, applied to two simple examples. You may have already heard about the SYCLomatic tool for migrating CUDA to SYCL [4]. Tool-assisted, source-to-source migration has its limitations, especially for templated code, not necessarily in terms of functionality but in the quality and maintainability of the resulting code. Some manual editing and code modernization is usually necessary; for example, introducing newer C++ features like lambda functions or automatic type deduction. However, when you’re done, the resulting code is based on open standards and aligned with modern C++. Thrust has been around for a decade, so it’s time to upgrade your code. Let’s have a look at some code examples.

Our first example is an in-place transformation of some values in a sequence. A mask defines which values to transform. The function takes two input streams: one that holds values and another that holds a stencil sequence. Each value from the first sequence is transformed (negation in the example code) if the stencil value from a second input sequence fulfills certain criteria (a non-zero predicate in the example code):

// Thrust code snippet (stage 0)
thrust::transform_if(dev_inp,
                     dev_inp + 10,
                     dev_stencil,
                     dev_inp,
                     thrust::negate<int>()
                     thrust::identity<int>());

oneDPL is closely aligned to the standard C++ algorithms. However, the STL doesn’t support stencil overloading, so neither does oneDPL. This pattern must be expressed differently, so SYCLomatic provides a functional alternative as a drop-in replacement:

// Migrated code (stage 1)
dpct::transform_if(oneapi::dpl::execution::make_device_policy(syclQueue),
                   dev_inp,
                   dev_inp + 10,
                   dev_stencil,
                   dev_inp,
                   std::negate<int>(),
                   dpl::identity<int>());  // identity is a C++20 feature

We can do better by using a lambda function as a custom functor, thereby creating standard C++ code. This way the stencil sequence is handled differently as a generic input sequence. The predicate evaluation is done inside the custom functor and not hidden within the internals of a library:

// After manual editing (stage 2)
dpl::transform(dpl::execution::dpcpp_default,
               dev_inp,
               dev_inp + 10,
               dev_stencil,
               dev_inp,
               [&](const auto& input, const auto& mask)
                  {
                     return mask ? std::negate<>()(input) : input;
                  });

Now we have expressed our pattern using STL behavior exclusively, except for the custom execution policy that is needed to target a SYCL device. This is an important detail that differs between Thrust and oneDPL. Thrust has a default mode where the input iterator defines where to execute the algorithm. This is different from ISO C++ where parameters of execution are defined by the execution policy and not defined by iterators. oneDPL is closely aligned with standard C++ algorithms and provides a custom policy. You can use it to define where to execute an algorithm. For convenience, there is also a predefined policy that just targets the default SYCL device. Both policies are used in the stage 1 and 2 example codes above.

Our next migration example looks at another common pattern: sorting. The Thrust version sorts by key. This algorithm takes two input sequences: one that contains values and another that holds keys:

// Thrust code snippet (stage 0)
thrust::stable_sort_by_key(dev_values, dev_values + 10, dev_keys);

Our input sequence of values is now sorted by comparing elements of the keys sequence. Once again, SYCLomatic provides a functional alternative because the pattern isn’t in existing ISO C++ functions:

// Migrated code (stage 1)
dpct::stable_sort(oneapi::dpl::execution::make_device_policy(syclQue),
                  dev_values,
                  dev_values + 10,
                  dev_keys);

We can do better with a custom iterator and functor. We use a zip iterator to tie the key and value sequence together and create a single iteration space. Our custom functor operates on value-key pairs only comparing the key elements. The result is slightly more verbose, but it only uses ISO C++ functions with generic extensions, such as custom iterators:

// After manual editing (stage 2)
dpl::stable_sort(dpl::execution::dpcpp_default,
                 make_zip_iterator(dev_values, dev_keys),
                 make_zip_iterator(dev_values, dev_keys) + 10,
                 [](const auto& a, const auto& b)
                   {
                      return get<1>(a) < get<1>(b);
                   });

Libraries with accelerated implementations of C++ STL algorithms can greatly boost developer productivity. A tool-assisted migration step can be complemented with manual code editing to improve maintainability and code quality. For future development, a range-based API and the P2300R5 execution proposal [5] are viable alternatives to overcome current limitations in C++ and express the parallel patterns as described in this article. However, ISO C++ is not there yet, and those features would require a major rework of existing applications. For now, to avoid proprietary languages and vendor lock-in and to get access to multi-vendor hardware, it’s best to stick with ISO C++ whenever possible and keep the extensions generic.

References

  1. oneDPL repository: https://github.com/oneapi-src/oneDPL
  2. Thrust repository: https://github.com/NVIDIA/thrust
  3. The Case for SYCL: Why ISO C++ Is Not Enough for Heterogeneous Computing
  4. SYCLomatic repository: https://github.com/oneapi-src/SYCLomatic. Note that the SYCLomatic examples in this article are for illustration purposes only. The actual result of an implementation such as the Intel DPC++ Compatibility Tool might look different.
  5. P2300R5 std::execution