SYCL* Interoperability: A Deep Dive into Bridging CUDA* and oneAPI

Get the Latest on All Things CODE

author-image

By

Using the Intel® DPC++ Compatibility Tool or SYCLomatic to migrate CUDA* code to C++ with SYCL* frequently gets you pretty far in the journey towards freeing your existing CUDA-based accelerated compute workload from vendor-lock so it can be executed on a wide range of multiarchitecture platforms.

Nevertheless, running the migration tool is only the first major step when porting your codebase.  Usually, a few challenges remain to be resolved after the initial migration.  The developer often needs to identify and apply workarounds after running the SYCLomatic to address warnings and errors encountered during and post-migration.

The different API architecture between CUDA and SYCL could also cause performance impacts on parts of the workload that we will want to understand and resolve.

The main objective of moving from CUDA to SYCL is software portability across different platform configurations. One of the goals also has to be performance portability. Ensuring performance portability thus becomes a key aspect of completing the migration.   

However, for some CUDA APIs, there may not be an immediately obvious direct match to the SYCL API and the associated oneAPI ecosystem library solutions.

Imagine a mechanism that functions as a translator between SYCL and a vendor-specific backend.

In this blog, we will dive a bit deeper into how SYCL Interoperability enables such a translation layer and will discuss how the oneAPI ecosystem is ready to bridge the gaps between varying APIs, making software migration even more seamless.

Challenge: Non-Migrated CUDA APIs     

One of the most common and complex challenges is identifying workarounds for unmigrated CUDA APIs, which often require redesigning/rewriting an alternate logic that respects the differences of features supported by SYCL vs. CUDA to proceed with compilation and validation.

However, transitioning from an established CUDA environment to the new, rapidly evolving oneAPI ecosystem with workarounds often brings up daunting nuances in performance, coding semantics, and paradigms of the two ecosystems. Especially when overcoming the hard gaps between the functional offerings of CUDA and oneAPI ecosystems.

One such scenario is migrating CUDA applications that use cuSparse APIs, for which there are no exact equivalent alternative SYCL APIs available to run on a Nvidia platform yet.

This is no longer a barrier to migration because of the interoperability feature offered by SYCL.

Interoperability: Call CUDA APIs from within SYCL

In collaboration with the SYCL working group and with contributions from companies like Codeplay, Intel has been instrumental in enhancing SYCL's specifications. One of the standout features is its emphasis on vendor-independence.

This has paved the way for SYCL to introduce the interoperability feature that allows for direct invocation of CUDA APIs from within the SYCL code. This is not just a theoretical concept; it's a practical, mature solution used in oneAPI libraries like oneMKL and oneDNN to support Nvidia and AMD platforms.

It is also revolutionary from the performance standpoint, as developers can now harness CUDA's established functionalities from within the oneAPI framework without degradation in performance. The heavy lifting, intricate translations, and adjustments are all adeptly managed by the LLVM-based Intel® oneAPI DPC++/C++ Compiler in the background.

SYCL runtime classes can encapsulate an object unique to a hardware vendor’s SYCL backend.  The hardware-specific backend underpins the functionality of that class and provides an interface between the SYCL runtime object and the native backend object (CUDA/HIP). It enables interoperability between SYCL and the associated SYCL backend API from within an application.

Thus, SYCL interoperability provides a mechanism that functions as a translator between SYCL and a backend (such as CUDA) by allowing the users to access and manipulate the underlying backends’ objects through SYCL objects (queue, device, context, memory, etc.) and to execute native backend APIs (runtime, driver, and library) on the associated platform (such as Nvidia, AMD, etc.).

SYCL Interoperability in Action

SYCL interoperability provides a set of get_native free functions to derive native backend objects for all the SYCL objects (context, device, memory, queue, etc.).

In addition, backend specific operations can also be performed from inside a SYCL task graph using SYCL host_task and interop_handle classes. These provide access to the underlying native objects for the platform/device. host_task execution will be scheduled by the SYCL runtime.

Using SYCL interoperability in any existing partially migrated SYCL application is quite simple and straightforward.

  1. List all the parameters for the non-migrated CUDA APIs.
  2. Use CUDA equivalents directly instead of any partially migrated pass-by-value parameters (like enums) that SYCLomatic might have created.
  3. For the other parameters holding device memory, use get_native functions or interop_handle to get the native CUDA objects for memory space managed by SYCL.

Below is a small code snippet illustrating the use of SYCL interoperability with the migration of a simple CUDA application (full source code available here) that does matrix vector multiplication using cuSparse APIs, for which there are no equivalent oneAPI library APIs that support CUDA backend yet.

spMV.cu

#include <iostream>
#include <cuda_runtime.h>
#include <cusparse.h>

int main() {

    // Initialize cuSPARSE library
    ...

    // Memory management
    ...

    // Create matrix and vector descriptors
    cusparseCreateCsr(&matA, 3, 3, 4, d_csrRowPtr, d_csrColInd, d_csrVal, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO, CUDA_R_32F);
    cusparseCreateDnVec(&vecX, 3, d_x, CUDA_R_32F);
    cusparseCreateDnVec(&vecY, 3, d_y, CUDA_R_32F);

    float alpha = 1.0f;
    float beta = 0.0f;

    // Buffer creation and allocation
    ...

    // Perform matrix-vector multiplication: y = A*x
    cusparseSpMV(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, &alpha, matA, vecX, &beta, vecY, CUDA_R_32F, CUSPARSE_MV_ALG_DEFAULT, dBuffer);

    // Copy result back to host
    ...

    // Cleanup
    ...

    return 0;
}

Migrating this application with SYCLomatic results in the following code with DPCT1007 warnings pointing to unsupported migration of the cuSparse CUDA APIs:

spMV.dp.cpp

#include <sycl/sycl.hpp>
#include <dpct/dpct.hpp>
#include <iostream>
#include <dpct/sparse_utils.hpp>

#include <dpct/blas_utils.hpp>

#include <dpct/lib_common_utils.hpp>

int main() {

    // Initialize cuSPARSE library
    ...
    
    // Memory Management
    ...

    // Create matrix and vector descriptors
    /*
    DPCT1007:0: Migration of cusparseCreateCsr is not supported.
    */
    cusparseCreateCsr(&matA, 3, 3, 4, d_csrRowPtr, d_csrColInd, d_csrVal,
                      CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
                      oneapi::mkl::index_base::zero,
                      dpct::library_data_t::real_float);
    /*
    DPCT1007:1: Migration of cusparseCreateDnVec is not supported.
    */
    cusparseCreateDnVec(&vecX, 3, d_x, dpct::library_data_t::real_float);
    /*
    DPCT1007:2: Migration of cusparseCreateDnVec is not supported.
    */
    cusparseCreateDnVec(&vecY, 3, d_y, dpct::library_data_t::real_float);

    float alpha = 1.0f;
    float beta = 0.0f;

    // Buffer creation and allocation
    ...

    // Perform matrix-vector multiplication: y = A*x
    /*
    DPCT1007:4: Migration of cusparseSpMV is not supported.
    */
    cusparseSpMV(handle, oneapi::mkl::transpose::nontrans, &alpha, matA, vecX,
                 &beta, vecY, dpct::library_data_t::real_float,
                 CUSPARSE_MV_ALG_DEFAULT, dBuffer);
    
    // Copy result back to host
    ...

    // Cleanup
    ...

    return 0;
}

After the completion of migration, SYCLomatic reports DPCT1007 warnings about the non-migrated cuSparse APIs. Compiling this code results in compilation errors and can’t be run on Nvidia GPUs. However, with SYCL interoperability, this semi-migrated application can be run completely on Nvidia with only a few changes, as shown (highlighted in bold) below:

spMV_with_SYCL_Interop.dp.cpp

#include <sycl/sycl.hpp>
#include <dpct/dpct.hpp>
#include <iostream>
#include <dpct/sparse_utils.hpp>

#include <dpct/blas_utils.hpp>

#include <dpct/lib_common_utils.hpp>

#include <cusparse.h>

int main() {
    
    // Initialize cuSPARSE library
    ...

    // Memory Management
    ...

    // Create matrix and vector descriptors
    /*
    DPCT1007:0: Migration of cusparseCreateCsr is not supported.
    */
    /*
    // SYCL Interop
    cusparseCreateCsr(&matA, 3, 3, 4, d_csrRowPtr, d_csrColInd, d_csrVal, 
                      CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I, 
                      CUSPARSE_INDEX_BASE_ZERO, 
                      CUDA_R_32F);
    /*
    DPCT1007:1: Migration of cusparseCreateDnVec is not supported.
    */
    // SYCL Interop
    cusparseCreateDnVec(&vecX, 3, d_x, CUDA_R_32F);

    /*
    DPCT1007:2: Migration of cusparseCreateDnVec is not supported.
    */
    // SYCL Interop
    cusparseCreateDnVec(&vecY, 3, d_y, CUDA_R_32F);

    float alpha = 1.0f;
    float beta = 0.0f;

    // Determine buffer size required for the cusparseSpMV operation
    size_t bufferSize = 0;
    /*
    DPCT1007:3: Migration of cusparseSpMV_bufferSize is not supported.
    */
    // SYCL Interop
    cusparseSpMV_bufferSize(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, &alpha,
        matA, vecX, &beta, vecY,
        CUDA_R_32F,
        CUSPARSE_MV_ALG_DEFAULT, &bufferSize);

    // Buffer creation and allocation
    ...

    // Perform matrix-vector multiplication: y = A*x
    /*
    DPCT1007:4: Migration of cusparseSpMV is not supported.
    */
    // SYCL Interop
    cudaStream_t nativeStream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(q_ct1);
    cudaStreamCreate(&handle);
    cusparseSetStream(handle, stream);
    
    cusparseSpMV(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, &alpha, matA, vecX,
        &beta, vecY, CUDA_R_32F,
        CUSPARSE_MV_ALG_DEFAULT, dBuffer);

    // Copy result back to host
    ...

    // Cleanup
    ...

    return 0;
}

Here, SYCL Interoperability is best illustrated in deriving the native stream handle for cusparseSpMV API using a SYCL queue with sycl::get_native function and the requested backend being passed as a template argument.

Note: Compiling the SYCL code using Interoperability will require the user to add native CUDA library header and library files to the compiler include path.

Full CUDA Functionality via SYCL

As Intel and its ecosystem partners continue to enhance oneAPI’s offerings, SYCL's interoperability provides a solution for multi-vendor hardware support.

It offers a clear, efficient, and performance-oriented path between CUDA and oneAPI, allowing to combine the strengths of both ecosystems and driving SYCL-based projects to production readiness with efficiency.

Developers can now harness CUDA's functionality from within the oneAPI framework without degradation in performance.

Additional Resources

Get the Software

We encourage you to Get Started with the Intel DPC++ Compatibility Tool for efficient CUDA-to-SYCL code migration. 

Check out this extensive list of migrated code samples with detailed guidance and before/after comparison. They can all be found on the Migrate from CUDA to C++ with SYCL Portal.  

Explore some more additional useful resources to start your journey from CUDA to SYCL: 

You can get the Intel DPC++ Compatibility Tool included as a part of the Intel oneAPI Base Toolkit