Verify your CUDA* to SYCL* Migration Using CodePin

with Intel® DPC++ Compatibility Tool

June 29, 2024

Get the Latest on All Things CODE

author-image

By

Ensure SYCL Code Correctness after Migration

The growing adoption of GPU offloading in applications that utilize parallel programming frameworks such as CUDA or SYCL is driven by the need for enhanced performance and efficiency.  As developers migrate code from CUDA to SYCL to leverage the benefits of SYCL’s platform independence and enhanced features, they encounter highly complex applications that require meticulous examination to ensure correct functionality post-migration. These applications span diverse domains in science, industry, security, and transportation. Frequently they combine advanced data analytics with AI based predictions for scenarios like emergency or medical response. In short, they are mission-critical, and results must be consistently reproducible across platforms using multivendor ingredients.

Migrated SYCL code might experience issues such as differences in arithmetic precision between hardware, Semantic differences between the CUDA and SYCL APIs, and errors introduced during the migration process. SYCLomatic and its binary distribution counterpart, the Intel® DPC++ Compatibility Tool, use syntax matching and advanced heuristics, but not all possibilities for divergent results may be captured. This necessitates strict code integrity maintenance and comprehensive unit testing for functional correctness. Ensuring the migrated codebase remains robust and reliable is crucial for achieving the desired outcomes in advanced technological fields.

CodePin Instrumentation to the Rescue

When migrating your complex GPU-accelerated applications to SYCL, verifying the functional correctness of the migrated codebase as early as possible in the software development cycle is crucial and will eliminate many developer headaches. Ensuring the code has been correctly and successfully migrated allows it to fully benefit from SYCL’s promise of scalability across various GPUs and compute accelerators.  The Intel DPC++ Compatibility Tool and SYCLomatic CodePin Instrumentation feature facilitate this process by enabling you to pinpoint inconsistencies.

CodePin reduces the need for extensive debugging by providing capabilities for on-the-fly functional testing during development and migration.

This integration ensures that you can identify and address issues promptly, maintaining the integrity and performance of your application throughout the migration process.

How CodePin Works

To utilize CodePin during migration, simply pass the option ‘—enable-codepin’.

For example:

$ dpct inlinePTX.cu –enable-codepin

You can follow the standard SYCLomatic migration process detailed in this link.

Instrumentation APIs:

The instrumentation implicitly calls and leverages the following two function APIs to create the instrumentation call wrappers for each migrated code segment:

  • gen_prolog_API_CP(): Instrumentation function that generates a prologue for a specific code segment. It prepares the environment for the instrumented code and logs the start of an operation.
  • gen_epilog_API_CP(): Instrumentation function that generates an epilogue for a specific code segment. It finalizes the environment for the instrumented code and logs the completion of an operation.

CodePin instrumentation and analysis follows 3 easy steps:

  1. Instrumentation and Compilation
  2. Comparing Runtime Behavior
  3. Analyzing CodePin Results

1. Instrumentation and Compilation Process

After enabling CodePin, the instrumented program files will be placed in two folders: ‘dpct_output_codepin_cuda’ and ‘dpct_output_codepin_sycl’

Compile the SYCL Code:

  • Navigate to the ‘dpct_output_codepin_sycl’ folder.
  • Compile the SYCL code using the command: ‘$ dpcpp inlinePTX.dp.cpp’
  • Execute the generated binary. This will produce a ‘CodePin_SYCL_<date>.json’ log file, which contains runtime behavior data for the SYCL code

Compile the CUDA Code 

  • Navigate to the ‘dpct_output_codepin_cuda’ folder
  • Compile the instrumented CUDA source code using the NVIDIA* CUDA Compiler driver ‘nvcc’
  • After execution, a ‘CodePin_CUDA_<date>.json’ log file will be generated, which contains runtime behavior data for the CUDA code.

2. Comparing Runtime Behavior

By comparing the reports

‘CodePin_SYCL_<date>.json’

and

‘CodePin_CUDA_<date>.json',

developers can gain insights into the runtime behavior of both the SYCL and CUDA versions of their applications. This comparison helps identify discrepancies and ensure that the migrated SYCL code performs as expected, thereby streamlining the debugging and optimization process.

3. Analyzing the CodePin Results

The ‘codepin-report.py’ script, part of the compatibility tool, processes execution log files from both CUDA and SYCL code to automate analysis. It detects inconsistent data values and provides statistical data on the execution.

$ codepin-report.py [-h] --instrumented-cuda-log <file path> --instrumented-sycl-log <file path>

Example of Instrumented CUDA and SYCL code

Let us take the guided_inlinePTX_SYCLMigration example as a reference.
Here, you can see what the migrated SYCL code looks like after running SYCLomatic without CodePin instrumentation.

This sample demonstrates inline PTX (assembly language) usage in SYCL kernels. The original CUDA* source code is migrated to SYCL for portability across GPUs from multiple vendors. The usage of CodePin instrumentation and the insights we can derive from it are the same for any similar type of sample code.

1. CodePin Instrumented SYCL Program

dpct_output_codepin_sycl/inlinePTX.dp.cpp is the migrated and instrumented SYCL program:


/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

/*
 * Demonstration of inline PTX (assembly language) usage in CUDA kernels
 */

#include <sycl/sycl.hpp>
#include <dpct/dpct.hpp>

#include <dpct/codepin/codepin.hpp>
#include "../../../codepin_autogen_util.hpp"
#include <stdio.h>
#include <assert.h>

#include <helper_functions.h>
#include <helper_cuda.h>

void sequence_gpu(int *d_ptr, int length, const sycl::nd_item<3> &item_ct1)
{
    int elemID = item_ct1.get_group(2) * item_ct1.get_local_range(2) +
                 item_ct1.get_local_id(2);

    if (elemID < length)
    {
        unsigned int laneid;
        //This command gets the lane ID within the current warp
        laneid = item_ct1.get_sub_group().get_local_linear_id();
        d_ptr[elemID] = laneid;
    }
}

void sequence_cpu(int *h_ptr, int length)
{
    for (int elemID=0; elemID<length; elemID++)
    {
        h_ptr[elemID] = elemID % 32;
    }
}

int main(int argc, char **argv)
{
    printf("CUDA inline PTX assembler sample\n");

    const int N = 1000;

    int dev = findCudaDevice(argc, (const char **) argv);

    if (dev == -1)
    {
        return EXIT_FAILURE;
    }

    int *d_ptr;
    checkCudaErrors(DPCT_CHECK_ERROR(
        d_ptr = sycl::malloc_device<int>(N, dpct::get_in_order_queue())));
dpctexp::codepin::get_ptr_size_map()[d_ptr] = N * sizeof(int);

    int *h_ptr;
    checkCudaErrors(DPCT_CHECK_ERROR(
        h_ptr = sycl::malloc_host<int>(N, dpct::get_in_order_queue())));

    sycl::range<3> cudaBlockSize(1, 1, 256);
    sycl::range<3> cudaGridSize(1, 1,
                                (N + cudaBlockSize[2] - 1) / cudaBlockSize[2]);

    dpctexp::codepin::gen_prolog_API_CP(
        "/inlinePTX/inlinePTX.cu:86:5",
        &dpct::get_in_order_queue(), "d_ptr", d_ptr, "N", N);

    dpct::get_in_order_queue().parallel_for(
        sycl::nd_range<3>(cudaGridSize * cudaBlockSize, cudaBlockSize),
        [=](sycl::nd_item<3> item_ct1) {
            sequence_gpu(d_ptr, N, item_ct1);
        });
    
    dpctexp::codepin::gen_epilog_API_CP(
        "/inlinePTX/inlinePTX.cu:86:5",
        &dpct::get_in_order_queue(), "d_ptr", d_ptr, "N", N);

    checkCudaErrors(
        DPCT_CHECK_ERROR(dpct::get_current_device().queues_wait_and_throw()));

    sequence_cpu(h_ptr, N);

    int *h_d_ptr;
    checkCudaErrors(DPCT_CHECK_ERROR(
        h_d_ptr = sycl::malloc_host<int>(N, dpct::get_in_order_queue())));
    checkCudaErrors(
        DPCT_CHECK_ERROR(dpct::get_in_order_queue()
                             .memcpy(h_d_ptr, d_ptr, N * sizeof(int))
                             .wait()));

    bool bValid = true;
    for (int i=0; i<N && bValid; i++)
    {
        if (h_ptr[i] != h_d_ptr[i])
        {
            bValid = false;
        }
    }

   
    return bValid ? EXIT_SUCCESS: EXIT_FAILURE;
}


/*
Execution Result:
CUDA inline PTX assembler sample
GPU Device 0: "Kepler" with compute capability 3.0

get_memory_info: ext_intel_free_memory is not supported.
get_memory_info: ext_intel_free_memory is not supported.
Test Failed. <--- incorrect result

*/   

 

 

2. CodePin Instrumented CUDA Program

dpct_output_codepin_cuda/inlinePTX.cu is the instrumented CUDA program: 


/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

/*
 * Demonstration of inline PTX (assembly language) usage in CUDA kernels
 */ 

#include <dpct/codepin/codepin.hpp>
#include "../../../codepin_autogen_util.hpp"
#include <stdio.h>
#include <assert.h>

#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>

__global__ void sequence_gpu(int *d_ptr, int length)
{
    int elemID = blockIdx.x * blockDim.x + threadIdx.x;

    if (elemID < length)
    {
        unsigned int laneid;
        //This command gets the lane ID within the current warp
        asm("mov.u32 %0, %%laneid;" : "=r"(laneid));
        d_ptr[elemID] = laneid;
    }
}

void sequence_cpu(int *h_ptr, int length)
{
    for (int elemID=0; elemID<length; elemID++)
    {
        h_ptr[elemID] = elemID % 32;
    }
}

int main(int argc, char **argv)
{
    const int N = 1000;
    int dev = findCudaDevice(argc, (const char **) argv);

    if (dev == -1)
    {
        return EXIT_FAILURE;
    }

    int *d_ptr;
    checkCudaErrors(cudaMalloc(&d_ptr, N * sizeof(int)));
dpctexp::codepin::get_ptr_size_map()[d_ptr] = N * sizeof(int);

    int *h_ptr;
    checkCudaErrors(cudaMallocHost(&h_ptr, N * sizeof(int)));

    dim3 cudaBlockSize(256,1,1);
    dim3 cudaGridSize((N + cudaBlockSize.x - 1) / cudaBlockSize.x, 1, 1);
    dpctexp::codepin::gen_prolog_API_CP(
        "/inlinePTX/inlinePTX.cu:86:5",
        0, "d_ptr", d_ptr, "N", N);
sequence_gpu<<<cudaGridSize, cudaBlockSize>>>(d_ptr, N);
    dpctexp::codepin::gen_epilog_API_CP(
        "/inlinePTX/inlinePTX.cu:86:5",
        0, "d_ptr", d_ptr, "N", N);

    sequence_cpu(h_ptr, N);

    int *h_d_ptr;
    checkCudaErrors(cudaMallocHost(&h_d_ptr, N *sizeof(int)));
    checkCudaErrors(cudaMemcpy(h_d_ptr, d_ptr, N *sizeof(int), cudaMemcpyDeviceToHost));

    bool bValid = true;

    for (int i=0; i<N && bValid; i++)
    {
        if (h_ptr[i] != h_d_ptr[i])
        {
            bValid = false;
        }
    }


    return bValid ? EXIT_SUCCESS: EXIT_FAILURE;
}

/*
Execution Result:
CUDA inline PTX assembler sample
GPU Device 0: "Ampere" with compute capability 8.0
Test Passed
*/

 

3. Identifying and Fixing the Issue

After compiling ‘dpct_output_codepin_sycl/inlinePTX.dp.cpp & dpct_output_codepin_cuda/inlinePTX.cu’, and executing the resulting binaries, the following log files will be generated:

Report for the instrumented CUDA program Report for the instrumented migrated SYCL program
"ID": " inlinePTX.cu:86:5:epilog",
     "Free Device Memory": "84528594944",
    "Total Device Memory": "84974239744",
    "Elapse Time(ms)": "71.8582",
    "CheckPoint": {
      "d_ptr": {
        "Type": "Pointer",
        "Data": [
          {
            "Type": "int",
            "Data": [
              0
            ]
          },
          {
            "Type": "int",
            "Data": [
              1
            ]
          },        
          ……
          {
            "Type": "int",
            "Data": [
              15
            ]
          },
          {
            "Type": "int",
            "Data": [
              16
            ]
          },     
         {
            "Type": "int",
            "Data": [
              17
            ]
          },    …..  32     
"ID": " inlinePTX.cu:86:5:epilog",
    "Free Device Memory": "0",
    "Total Device Memory": "540750516224",
    "Elapse Time(ms)": "0",
    "CheckPoint": {
      "d_ptr": {
        "Type": "Pointer",
        "Data": [
          {
            "Type": "int",
            "Data": [
              0
            ]
          },
          {
            "Type": "int",
            "Data": [
              1
            ]
          },        
          ……
          {
            "Type": "int",
            "Data": [
              15
            ]
          },
          {
            "Type": "int",
            "Data": [
              0
            ]
          },     
         {
            "Type": "int",
            "Data": [
              1
            ]
          },   ….

 

Using CodePin instrumented code and comparing CodePin log files, epilog in this case,  developers can verify SYCL and CUDA results and determine that the ‘d_ptr[]’ result is incorrect. The issue arises because the warp size in CUDA is a fixed constant of 32, whereas, in SYCL, the sub-group size is typically 16 or 32. To address this issue, the sub-group size can be restricted to 32, as recommended here, based on the indications from the CodePin log.

Moving to SYCL is Efficient and Reliable  

In conclusion, CodePin emerges as an invaluable tool for developers navigating the intricate landscape of CUDA to SYCL migration. By enabling comprehensive analysis and detection of inconsistencies in execution log files between CUDA and SYCL codebases, CodePin not only streamlines the migration process but also enhances code integrity and reliability. With its ability to pinpoint and report data discrepancies while providing insightful statistical analysis, CodePin empowers developers to confidently adopt SYCL for diverse GPU-accelerated applications, ensuring optimized performance across various hardware platforms.

Development of CodePin instrumentation to ensure an efficient and reliable SYCL migration process continues at full speed. Among the targeted features to enhance the tool’s capabilities further is the addition of assisted unit test case creation for functional verification in the continuous integration/continuous development (CI/CD).

Stay tuned as we continue to make multiarchitecture and multivendor GPU programming easy and more efficient.  

Migrate Your High-Accuracy Application to SYCL Today

Get the stand-alone version of the Intel DPC++ Compatibility Tool or as part of the Intel® oneAPI Base Toolkit—a core set of tools and libraries for developing high-performance, data-centric applications across diverse architectures.

You may also want to check out and provide feedback on the SYCLomatic open source distribution on GitHub*.

Additional Resources