oneAPI DPC++ Library Manual Migration Guide

ID 822493
Updated 5/7/2024
Version Latest
Public

author-image

By

SYCL* has emerged as an open, unified programming model to enable developer productivity across various hardware architectures, including CPU, GPU, and FPGA. Unlike proprietary programming models, which often constrain users to vendor-specific hardware, SYCL provides the freedom of multiarchitecture support and portability across various hardware types through its high-level standard and support for extensibility. Sometimes, to begin taking advantage of the SYCL programming model, you may first need to migrate an application from existing programming models, such as CUDA*. If you need to migrate your code from CUDA to SYCL, then, in addition to migrating raw CUDA source code, you must also migrate library calls reliant on CUDA to SYCL. Two commonly used libraries implemented in CUDA are NVIDIA* Thrust* and CUB*. To leverage the SYCL programming model's benefits while maintaining the algorithmic abstraction provided by Thrust and CUB, this guide discusses manually migrating applications using these libraries to instead use SYCL-based algorithms in the oneAPI DPC++ Library (oneDPL).

oneDPL Overview

oneDPL is a heterogeneous, productivity library that assists developers in parallel application development. Based on a set of core parallel patterns written in several backends, including oneTBB, OpenMP*, and SYCL, oneDPL consolidates host and device parallelism through an Application Programming Interface (API) motivated by the ISO/IEC 14882:2017 standard (C++17).

In addition to providing accelerator offload to the set of parallel algorithms in the C++17 standard, a set of data parallel compatibility headers is provided to aid in the migration of NVIDIA Thrust and CUB parallel libraries when migrating applications from CUDA C++ to the SYCL programming model. When using the Intel® DPC++ Compatibility Tool or its open source counterpart, the SYCLomatic tool (collectively called Compatibility Tools) for CUDA to SYCL migration, calls to CUDA libraries are migrated to SYCL libraries. Supported Thrust and CUB algorithm calls are automatically migrated to oneDPL algorithms. This document details the high-level process of manual Thrust and CUB API migration for code not currently automatically migratable with the Compatibility Tools or for users who wish to migrate without the tool.

SYCL Programming Model

While oneDPL supports several parallel backends, the oneDPL accelerator offload is built on the SYCL programming model. To begin migrating Thrust and CUB APIs from CUDA, you must familiarize yourself with the basics of SYCL. For users with previous CUDA experience, the SYCLomatic documentation provides a mapping guide for CUDA users and lists equivalencies of common CUDA paradigms in SYCL. 

To effectively utilize oneDPL for parallel accelerator offload, familiarization with SYCL buffers, USM memory, and queues is highly encouraged.

Suggested Environment Configuration

To begin the process of manual migration, you must properly configure oneDPL. oneDPL is provided as a header-only library, except for oneDPL support for SYCL device offload of C++ standard parallel algorithms (fsycl-pstl-offload). Installing and sourcing the setup scripts provided in the Intel® oneAPI Base Toolkit (Base Kit) automatically adds oneDPL headers to your include path. If you want to use the latest oneDPL features, you may clone the oneDPL GitHub repository and ensure the repository’s include directory is in your path. 

oneDPL headers are in the oneapi/dpl directory with header names corresponding to C++ STL counterparts. These headers contain an implementation of the C++17 PSTL along with extensions and a set of iterators. The following example details some of the common header includes with oneDPL:

#include <oneapi/dpl/execution> // Corresponding STL header: execution
#include <oneapi/dpl/algorithm> // Corresponding STL header: algorithm
#include <oneapi/dpl/numeric>   // Corresponding STL header: numeric
#include <oneapi/dpl/memory>    // Corresponding STL header: memory
#include <oneapi/dpl/iterator>  // Corresponding STL header: iterator

For more details on oneDPL installation and usage, see Get Started with the oneAPI DPC++ Library

For details on installing and configuring your system's oneAPI environment, see the Intel® oneAPI Toolkits Installation Guides. If you wish to use oneDPL on non-Intel hardware, you may refer to Codeplay's oneAPI for NVIDIA® GPUs and oneAPI for AMD GPUs (beta) guides.

oneDPL Specific Utility Helper Functions

This section is optional.

As a migration utility, oneDPL provides a set of compatibility headers for use by the Compatibility Tools. These headers contain implementations of many algorithms that do not directly overlap with a PSTL API. Additionally, if you have used the Compatibility Tools but the migration of the Thrust or CUB API you are using is not currently automatable, an implementation of this API may be provided in these headers. If you have installed the Compatibility Tools with your oneAPI installation, these headers are already on your system. The headers may be found in the SYCLomatic GitHub repository, and their exact locations are discussed in the SYCLomatic Documentation.

These headers may reduce the manual migration work on your side. However, using these headers is optional, and this guide discusses migrating without the Compatibility Tools header dependency.

The Migration Process

The steps for manually migrating source code written using NVIDIA Thrust and CUB libraries can be listed as follows:

  1. Identify CUDA APIs migratable to oneDPL.
  2. Identify CUDA APIs with no direct mapping to oneDPL.
  3. Choose the appropriate oneDPL policy type(s) for your migration.
  4. Migrate all operators and predicates from CUDA C++ to C++17.
  5. Complete the source code migration and test for correctness.
  6. Ensure migration performance.

Identify All CUDA APIs Migratable to oneDPL

When manually migrating Thrust or CUB APIs to a PSTL algorithm or oneDPL extension, API mappings may be found in two places:

  • oneDPL headers
  • C++ STL

Mapping to PSTL APIs

The core set of algorithms implemented in oneDPL is the set of parallel algorithms defined in the ISO/IEC 14882:2017 standard (C++17). When migrating from a Thrust or CUB API, you may first check if the API’s function signature overlaps with the set of parallel STL algorithms defined in the C++17 standard. If the algorithm does, the API directly maps to an oneDPL algorithm in the header corresponding to its PSTL counterpart within the oneapi/dpl subdirectory. Additionally, some CUDA APIs may have migration targets within the set of oneDPL extensions, which may be equivalently found in the same locations. All oneDPL algorithms of this type may be invoked from the oneapi::dpl and dpl namespaces. Additionally, non-extension oneDPL algorithms may be invoked from the std namespace with an execution policy.

Mapping Functors to the C++ STL

Since SYCL supports functors written in standard C++17 syntax, many of the operators and predicates exposed in Thrust and CUB namespaces migrate to functors in the C++ standard template library. Referring to the functional header documentation and other headers for additional STL features, you may find migration opportunities for many of the functors exposed in the Thrust or CUB namespace. The following example demonstrates how you may use the C++ functional header functor with oneDPL to perform a multiplication operation over a SYCL buffer:

#include <functional>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
...
oneapi::dpl::reduce(oneapi::dpl::execution::dpcpp_default,
                    oneapi::dpl::begin(my_buffer), oneapi::dpl::end(my_buffer), 1.0f, std::multiplies<>());

You can use std::multiplies from the C++ STL functional header in this example. This would be in contrast to using thrust::multiplies or a similar functor written in a CUDA device-executable manner. It is important to note that oneDPL also wraps around your STL implementation, so the example could be compiled by including <oneapi/dpl/functional> and using oneapi::dpl::multiplies<>(). The identical multiplication functor is used from the C++ STL on your system. Suppose the Thrust or CUB functor you are attempting to migrate does not have a direct mapping to the C++ STL. In that case, the Compatibility Tools headers should be searched to see if a mapping is currently implemented by oneDPL. Otherwise, the functor migration will need to be performed manually, which is described in Migration of Operators and Predicates.

Migrating Iterators and Data Management Utilities

Thrust provides a set of fancy iterators and utilities for host/device data management and offload. The steps mentioned in Mapping to PSTL APIs and Mapping Functors to the C++ STL follow for migration of these APIs. oneDPL provides a set of iterators beyond what is provided in the C++ STL and the compatibility headers for migration. Similarly, for data management, Thrust provides utilities such as thrust::device_vector and thrust::pointer for device memory management. While no such APIs are provided in the core of oneDPL, the process for implementing practical equivalencies is described in the next section.

Identify CUDA APIs with No Direct Mapping to oneDPL

Thrust and CUB provide APIs that do not directly overlap with a C++ PSTL algorithm or oneDPL extension. Suppose you are migrating an API of this type with no equivalency. In that case, you will either need to use the API provided by oneDPL in oneDPL Specific Utility Helper functions if supported or manually migrate the APIs.

Using APIs in the oneDPL Specific Utility Helper Functions

This section is optional.

Suppose you have used algorithms in the oneDPL Specific Utility Helper functions directly. In that case, you will gain access to a set of APIs designed to augment core oneDPL and SYCL functionality to provide a similar feature set to NVIDIA CUB and Thrust. The oneDPL compatibility headers may be included in the following manner if your oneAPI installation includes the Compatibility Tools:

#include <dpct/dpct.hpp>
#include <dpct/dpl_utils.hpp>

Unlike the core set of oneDPL algorithms, these algorithms are not included in oneDPL documentation as their primary purpose is for use by the Compatibility Tools. The easy way to search these headers for an equivalent API is to explore the <dpct/dpl_utils.hpp> header. Searching the algorithm, memory, and numeric compatibility headers reveals the currently implemented compatibility APIs. All algorithms in the compatibility headers are currently located in the dpct namespace. 

The location of oneDPL compatibility headers and the namespace where its algorithms reside are subject to potential future deprecation, renaming, and movement. If you encounter issues with locating the headers, reference the oneDPL release notes and documentation for any updates about the location of these APIs. 

If you find that a parallel API you are using is not migratable via oneDPL or its compatibility headers, you may request this feature via GitHub by filing an issue.

Manually Migrating APIs in the oneDPL Specific Utility Helper Functions

While the oneDPL Specific Utility Helper functions are provided as open-source software, not all users may have access to these headers or wish to add this dependency to their application. Most APIs in these headers can be categorized into several categories, mapping to higher-level oneDPL algorithm calls listed in Intel® oneAPI DPC++ Library Specific Utility Helper Functions. These categories are detailed, and sample implementations of how you may use them in your application are shown in the following sections.

Mask-Predicated Algorithms

The mask-predicated algorithms are the first class of algorithms implemented in the oneDPL Specific Utility Helper functions. These algorithms are motivated by their C++ STL counterparts, except that an additional iterator is provided that defines a mask dictating if an operation should be applied to its corresponding offset in the input buffer. As an example of this class of algorithms, consider the mask-predicated version of copy_if shown in the following snippet. See the following example that details the equivalent API produced by the Compatibility Tools when mapping from thrust::copy_if.

int n = 5;
int input[]  = { 0, 1, 2, 3, 4 };
int mask[]   = { 0, 1, 0, 1, 0 };
int output[] = { 0, 0, 0, 0, 0 };
// DPC++ compatibility tool migrates mask overload thrust::copy_if to dpct::copy_if
auto end_output = dpct::copy_if(std::execution::seq, input, input + 5, mask, output,
    [](auto e) { return e == 1; });
// Value of output: [ 1, 3, 0, 0, 0 ]
// mask is unmodified

In the above example of using dpct::copy_if, the predicate is applied to each element in mask and specifies which elements of input should be copied to output. Our predicate checks if the mask equals one and copies all corresponding instances from input to output. To implement this pattern without the dependency of dpct::copy_if, the base oneDPL algorithm oneapi::dpl::copy_if may be combined with a oneDPL zip iterator to achieve this effect. You may determine which indices to copy from the input to the output buffer by zipping over the mask and input sequences. The following oneDPL call achieves the equivalent mask-predicated copy_if operation:

int n = 5;
int input[]  = { 0, 1, 2, 3, 4 };
int mask[]   = { 0, 1, 0, 1, 0 };
int output[] = { 0, 0, 0, 0, 0 };
auto end_output = dpl::copy_if(std::execution::seq,
                               dpl::make_zip_iterator(input, mask),
                               dpl::make_zip_iterator(input + n, mask + n),
                               dpl::make_zip_iterator(output, dpl::discard_iterator()),
                               [](auto input_tuple) { return std::get<1>(input_tuple) == 1; });
// Value of output: [ 1, 3, 0, 0, 0 ]
// mask is unmodified

In the case of dpct::copy_if, this is implemented in a generic manner similar to the previous example, and you can expose this for use in the dpct namespace algorithm called in the first example. You may take a similar approach if only using the core oneDPL headers. The previous strategy may be applied similarly for the following mask-predicated algorithms: 

  • copy_if
  • partition
  • partition_copy
  • remove_if
  • remove_copy_if
  • replace_if
  • replace_copy_if
  • stable_partition_copy
  • stable_partition

Key-Value Pair Algorithms

The second class of algorithms implemented in the oneDPL Specific Utility Helper functions is the set of key-value pair algorithms. Like the mask-predicated algorithms, the key-value algorithms accept an additional iterator compared to the corresponding STL algorithm where the provided comparator or predicate is applied. The sequence in which this predicate is applied is called the keys, and the second input sequence is referred to as the values. Unlike the mask-predicated algorithms that only modify a single sequence, the key-value pair algorithms modify the user-provided keys along the provided values. An example using dpct::unique is shown below:

int n = 6;
int keys[]   = { 1, 1, 2, 2, 3, 3 };
int values[] = { 0, 1, 2, 3, 4, 5 };
// DPC++ compatibility tool migrates mask overload thrust::copy_if to dpct::copy_if
auto end_pair = dpct::unique(std::execution::seq,
                             keys, keys + n, values,
                             [](auto key1, auto key2) { return key1 == key2; });
// Keys:   [ 1, 2, 3, x, x ]
// Values: [ 0, 2, 4, x, x ]

The algorithm performs the typical std::unique on the input keys and applies the equivalent changes to the provided values. Once again, these algorithms can be implemented using oneDPL calls and zip iterators, shown in the following example:

int n = 6;
int keys[]   = { 1, 1, 2, 2, 3, 3 };
int values[] = { 0, 1, 2, 3, 4, 5 };
auto end_output = dpl::unique(std::execution::seq,
                              dpl::make_zip_iterator(keys, values),
                              dpl::make_zip_iterator(keys + n, values + n),
                              [](auto tup1, auto tup2) {
                                  return std::get<0>(tup1) == std::get<0>(tup2);
                              });
// Keys:   [ 1, 2, 3, x, x ]
// Values: [ 0, 2, 4, x, x ]

The following key-value pair algorithms may be implemented similarly:

  • unique
  • unique_copy
  • merge
  • sort
  • stable_sort
  • set_difference
  • set_intersection
  • set_symmetric_difference
  • set_union

Memory Algorithms

The set of memory APIs is the third class of algorithms implemented in the oneDPL Specific Utility Helper functions. The memory APIs provide practical equivalencies to the utilities of thrust::device_vector, thrust::device_iterator, and thrust::device_reference implemented in SYCL with USM shared memory to enable access from both the host and device code. Additionally, a set of memory APIs, including malloc, malloc_device, free, and free_device, are implemented as wrappers around SYCL malloc and free functions to map from Thrust memory allocation APIs easily. Suppose you wish to implement these APIs yourself without the dependencies on Compatibility Tools headers. In that case, referencing the implementation details of these algorithms in the dpct namespace is a good starting point.

Additional Algorithms

The classes of algorithms previously shown cover a significant portion of the oneDPL Specific Utility Helper functions APIs but are not exhaustive. Intel® oneAPI DPC++ Library Specific Utility Helper Functions provides discussions of these APIs in greater detail. Referencing the open-source implementations of the oneDPL Specific Utility Helper functions is a good starting point for any migrations you may wish to implement independently. 

As of oneAPI 2024.1, the current list of APIs that fall within this category are:

  • partition_point
  • iota
  • inner_product
  • for_each_index
  • transform_output_iterator
  • constant_iterator
  • arg_index_input_iterator
  • histogram_even
  • histogram_even_roi
  • multi_histogram_even
  • multi_histogram_even_roi
  • multi_histogram_range
  • multi_histogram_range_roi
  • sort_keys
  • sort_pairs
  • segmented_sort_keys
  • segmented_sort_pairs
  • equal_range
  • partition_flagged
  • partition_if
  • unique_count
  • nontrivial_run_length_encode
  • segmented_reduce_argmax
  • segmented_reduce_argmin

Choose the Appropriate oneDPL Policy

Modeled after the C++ parallel standard template library, parallelism through oneDPL is driven through explicitly provided oneDPL Execution Policies. In contrast to Thrust’s support of implicit dispatch based on tagged pointers and iterators, you must provide an execution policy dictating the type of parallel or serial execution you wish to achieve. This is comparable to the explicit dispatch mode via policies in Thrust. For migration from Thrust algorithms called on device data or all CUB migration cases, a oneDPL device policy should be used. The following example details how a device policy may be created to invoke a oneDPL algorithm on a GPU device:

...
sycl::queue q(sycl::gpu_selector_v); // select default GPU device
sycl::buffer<int> my_buffer(N);
...
int res = oneapi::dpl::reduce(oneapi::dpl::execution::make_device_policy(q),
                              oneapi::dpl::begin(my_buffer), oneapi::dpl::end(my_buffer));

The previous example shows how a parallel reduction may be achieved over a SYCL buffer in oneDPL with a user-provided queue on a GPU device. A queue, by default, operates in an out-of-order execution mode, meaning that kernels submitted to a sycl::queue may not be executed in their order of submission. This contrasts with a CUDA stream, which schedules kernels in order of submission by default. However, by utilizing properties in SYCL, you can create a queue that schedules kernels in an in-order fashion, achieving similar scheduling behavior to a CUDA stream. The following example shows how to create an in-order queue that is further used to construct a oneDPL policy:

...
sycl::property_list properties(sycl::property::queue::in_order{});
sycl::queue q(sycl::gpu_selector_v, properties); // select default GPU device
sycl::buffer<int> my_buffer(N);
...
int res = oneapi::dpl::reduce(oneapi::dpl::execution::make_device_policy(q),
                              oneapi::dpl::begin(my_buffer), oneapi::dpl::end(my_buffer));

When using a oneDPL synchronous API, differing queue execution types may not cause much difference. However, if migrating to oneDPL experimental asynchronous APIs or using oneDPL with asynchronous handwritten kernels, the choice of queue execution property may be significantly important with regards to performance and correctness.

If you wish to enable SYCL to determine the device to execute heuristically, then the following policy type may be used:

...
sycl::buffer<int> my_buffer(N);
...
int res = oneapi::dpl::reduce(oneapi::dpl::execution::dpcpp_default,
                              oneapi::dpl::begin(my_buffer), oneapi::dpl::end(my_buffer));

This guide has discussed oneDPL policies for only commonly encountered migration scenarios for GPU device offload. If you would like to learn more about oneDPL policies, you can refer to the oneDPL Execution Policies.

Migration of Operators and Predicates

Migration of operators and predicates passed to oneDPL algorithms is a simple process. All oneDPL backends, including device offload with the SYCL 2020 programming model, support C++17 style lambdas and functors. In other words, operators and predicates passed to oneDPL algorithms may be structured like any other STL algorithm call. When using NVIDIA Thrust or CUB, you may be familiar with the following syntax of a CUDA C++ generic functor to execute on the device:

template <typename T>
struct my_unary_predicate {
    __device__
    bool operator()(T t) {
        return ...;
    }
};

Or with the non-generic lambda syntax in newer CUDA versions:

auto my_unary_predicate = [] __device__ (int t) -> bool { return ...; };

The process is simple: migrate functors and lambdas for usage in oneDPL. All __host__ and __device__ tags need to be removed and all functors must be migrated such that their function call operator is const-qualified per the SYCL specification. Additionally, generic lambdas are supported in the SYCL programming model. Equivalent migrations of the previous examples are:

template <typename T>
struct my_unary_predicate {
    bool operator()(T t) const {
        return ...;
    }
};

For the functor migration and lambda version:

auto my_unary_predicate = [] (int t) -> bool { return ...; };          // __device__ removed
auto my_generic_unary_predicate = [] (auto t) -> bool { return ...; }; // generic version if type is unknown

The previous examples are provided for user-defined operators and predicates. As discussed in Identify All CUDA APIs Migratable to oneDPL, many of the operators and predicates defined within the Thrust and CUB namespaces may have direct mappings to the STL functional header and dpct namespace within the Compatibility Tools headers.

Complete the Migration

The previous sections detailed the unique steps required to migrate Thrust and CUB APIs to oneDPL. Once the appropriate migrations, policy types, and functor migrations have been identified, the manual migration may be performed. Ensuring that all Thrust and CUB header inclusions are replaced with the appropriate oneDPL headers is important. Furthermore, any standalone CUDA code must also be migrated to SYCL through the Compatibility Tools or manual migration. 

Once the APIs have been migrated, functional testing of your application is highly encouraged to ensure the correctness of the migration and the resultant code. For assistance compiling your migrated code, please refer to the Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference, particularly the Offload Compilation, OpenMP*, and Parallel Processing Options sections.

Performance Considerations

After your application has been successfully migrated, you should check the new performance. There are several key components to consider in the following sections.

Examine the Cost of Host and Device Transfers

While oneDPL accepts std::vector iterators to invoke device algorithms, additional overhead is incurred with these styles of algorithm calls. An initial transfer of host data to a device-accessible buffer is performed to enable access to data on the device, followed by a final transfer back to the host for relevant algorithms. When calling many oneDPL algorithms in succession, these transfers are incurred per algorithm. In such situations where many oneDPL calls are present, it is more performant to enable the SYCL runtime to manage data access via a sycl::buffer, USM shared allocation, or through manually copying your data between the device in USM device memory. 

If you are working with small inputs, the cost associated with host-to-device offload may not be worth the benefit of massively parallel execution on the device. This is particularly true for data frequently accessed by your application on the host and which does not primarily live on the device. For data inputs of this type, parallel or vectorized execution with a oneDPL host policy may result in better performance. In such situations, you are encouraged to profile your migrated application with tools such as Profiling Tools Interfaces for GPU (PTI for GPU), Intel® Advisor, and VTune™, which can assist in identifying transfer overheads and other performance bottlenecks. Moreover, the benefit observed from offloading oneDPL algorithms to your device is highly dependent on your CPU and GPU device architectures, and the best-performing oneDPL policies may differ depending on your hardware configuration.

Choose the Appropriate Queue Types

The default queue type in the SYCL programming model is the out-of-order queue. Small but meaningful overheads are required when managing this out-of-order property. If your application does not leverage the benefits of an out-of-order execution model within a SYCL queue, then, switching to an in-order queue may improve performance if many kernel launches are present.

Consider Using oneDPL Experimental Features

Once your application has been migrated to use oneDPL APIs, you should see if your application may benefit from experimental oneDPL features. Some of these experimental features present in oneAPI 2024.1 are shown below. 

If your migrated application executes many logically independent oneDPL calls, you may observe benefits by leveraging Asynchronous API Algorithms. These nonblocking calls enable you to run several oneDPL algorithms concurrently. If your input sizes are small enough not to fully saturate your device’s compute units, then launching several asynchronous oneDPL calls may result in higher device occupancy. A device policy over the same out-of-order queue should be used in such instances. 

A oneDPL Range-Based API may be useful if your application performs complex transformations over input data. Instead of invoking several oneDPL calls in succession to perform your transformation, piping several range adaptors together in a single algorithmic call reduces the number of required kernels to perform your transformation. The reduced kernel launch and synchronization overhead, coupled with reduced passes over your input, may result in improved application performance. 

oneDPL kernel templates, a new feature introduced in oneAPI 2024.1, offer an opportunity for low-level optimization of oneDPL algorithms for specific device architectures. Tunable through C++ templates, you may configure work group sizes and work item processing volume to achieve the best performance for your use case. Furthermore, kernel templates may employ algorithmic techniques applicable to specific devices with certain preconditions (for example, forward progress guarantees), so ensuring your device meets these requirements is important. In oneAPI 2024.1, oneDPL exposes an experimental radix sort kernel template API that offers a potential performance boost over the architecture-independent oneDPL implementation if you wish to tune template parameters yourself.

Special Considerations When Migrating from CUB

The migration from Thrust to oneDPL is a straightforward process as both libraries are motivated by the semantics of the C++ standard library. However, certain considerations need to be made when migrating from NVIDIA CUB. Firstly, CUB algorithms only operate on the device. For example, in both Thrust and oneDPL, a reduction result is returned on the host. However, the result may be stored in a user-provided device allocation passed in the function call in CUB. When migrating from CUB, it is important to consider the API differences to ensure a functional and performant migration.

Querying API Migrations With The Compatibility Tools 

The steps to manually migrate NVIDIA Thrust and CUB code to oneDPL have been detailed in the previous sections. If you have decided not to use the Compatibility Tools to assist the migration of your application, the tool can still be used to get a sense of what an API migration to oneDPL may look like in your codebase. This is achieved through the --query-api-mapping=<val> flag in the Compatibility Tools introduced in oneAPI 2024.0. You can use this tool to see what a migrated CUB or Thrust API may look like in oneDPL.

As an example of migrating a CUB API, see the migration of cub::DeviceReduce::Sum with oneAPI 2024.0 below:

> dpct --query-api-mapping=cub::DeviceReduce::Sum
  CUDA* API:
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    cub::DeviceReduce::Sum(temp_storage/*void **/, temp_storage_bytes/*size_t &*/, d_in/*InputIteratorT*/, 
        d_out/*OutputIteratorT*/, num_items/*int*/, stream/*cudaStream_t*/);
  
Is migrated to:
    dpct::queue_ptr stream;
    stream = dpct::get_current_device().create_queue();
    stream->fill(d_out, oneapi::dpl::reduce(oneapi::dpl::execution::device_policy(*stream), 
        d_in, d_in + num_items, typename std::iterator_traits<decltype(d_out)>::value_type{}), 1).wait();

As discussed in the previous performance section, differences exist between the CUB and oneDPL APIs. To achieve identical semantics to the CUB example in oneDPL, you must add an additional fill operation on top of the sycl::queue to store the result on the device, as oneapi::dpl::reduce returns a result on the host. 

See the following example of migrating thrust::all_of with the API query tool:

> dpct --query-api-mapping=thrust::all_of
    CUDA* API:
    ...
      /*5*/ thrust::all_of(thrust::device /*const thrust::detail::execution_policy_base<DerivedPolicy > &*/,
             d_A.begin() /*InputIterator*/,
             d_A.begin() + 2 /*InputIterator*/, thrust::identity<bool>());
    ...
    Is migrated to:
    ...
      /*5*/ oneapi::dpl::all_of(oneapi::dpl::execution::make_device_policy(q_ct1),
            d_A.begin(), d_A.begin() + 2, oneapi::dpl::identity());
    ...

In certain instances, the Compatibility Tools querying tool displays migrations for differing overloads with different policy types. For brevity, a single case was shown previously. 

Lastly, an example of the migration of thrust::set_symmetric_difference_by_key, which has no counterpart in the C++17 standard or in the oneDPL extensions, has been provided. In this case, the Compatibility Tools headers are used and referenced in the dpct namespace:

> dpct --query-api-mapping=thrust::set_symmetric_difference_by_key
CUDA* API:
  /*1*/ thrust::set_symmetric_difference_by_key(
          thrust::host /*const thrust::detail::execution_policy_base< DerivedPolicy> &*/,
          A_keys /*InputIterator1 */, A_keys + 7 /*InputIterator1 */,
          B_keys /*InputIterator2*/, B_keys + 5 /*InputIterator2*/,
          A_vals /*InputIterator3*/, B_vals /*InputIterator4*/,
          keys_result /*OutputIterator1*/, vals_result);
...
Is migrated to:
  /*1*/ dpct::set_symmetric_difference(oneapi::dpl::execution::seq, A_keys, A_keys + 7,
            B_keys, B_keys + 5, A_vals, B_vals, keys_result, vals_result);

Although this tool is not required to migrate to oneDPL, it may help you at the start of your migration process.

Next Steps

The Compatibility Tools support migration of calls from multiple CUDA libraries to SYCL. This guide only covers the manual migration of Thrust and CUB to oneDPL. For information on additional library migrations, please refer to the SYCLomatic Documentation.

Summary

This guide details the high-level steps of manually migrating applications using NVIDIA Thrust and CUB libraries to oneDPL. The step-by-step processes of migrating source codes using these libraries and various performance considerations have been provided. The detailed steps can be combined with the Compatibility Tools to automatically migrate as many APIs as possible while manually migrating the rest or be manually performed for the entire migration process. Common oneDPL migration cases have been explored in this document, but more complex scenarios and documentation may be found in the oneAPI samples and oneDPL documentation.

Notices and Disclaimers

Performance varies by use, configuration and other factors. 

No product or component can be absolutely secure.

Intel technologies may require enabled hardware, software or service activation.

All product plans and roadmaps are subject to change without notice.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. *Other names and brands may be claimed as the property of others. SYCL is a trademark of the Khronos Group Inc.