Migrating the CFD Poisson Solver from CUDA* to SYCL* Achieved Up to a 1.9x Performance Improvement

author-image

By

This article was also published on oneAPI.io.

The unified code base of CFD solver can run seamlessly across different GPU architectures.

Summary

Indian Institute of Technology Goa (IIT Goa), the premier Indian academic institute, used tools in the Intel® oneAPI Base Toolkit to free itself from vendor hardware lock-in by migrating its 2D Poisson Equation Solver from CUDA* to SYCL*. As a result, the solver’s performance improved by 1.9x on Intel® Data Center GPU Max Series 1550 compared to the performance on an NVIDIA* A100 GPU.

Introduction

IIT Goa offers state-of-the-art education, research, and training in science and technology to impact society, environment, and global challenges.

The Navier-Stokes equation governs most physical processes. However, it remains the billion-dollar unsolved equation in the mathematics community since there is no exact solution to the equation. However, scientists in the CFD community have been developing Navier-Stokes solvers to study the flow field of various physical processes with the help of numerical methods in mathematics. Since it is difficult to test the functionality of the solvers due to the complexity of the space and time variables in the Navier-Stokes equation, a Poisson equation is used as the standard test solver across the scientific domain. The Poisson equation acts as the standard governing equation for different physical phenomena like heat transfer study or study of diffusion processes. Hence, we have used a 2D Poisson equation developed by IIT Goa for our performance portability study using oneAPI and perhaps later extend the study to high-end state-of-the-art CFD solvers.

""

Figure 1. Sample computational stencil and mapping to 1D Array

 

Challenge: Vendor Hardware Lock-In

Fueled by high computational throughput and energy efficiency, GPUs have been quickly adopted as computing engines for high-performance computing (HPC) applications in recent years. The growing representation of heterogeneous architectures combined with general-purpose multicore platforms and accelerators has led to the redesign of codes for parallel applications.

The Poisson equation was developed in C to efficiently use the multicore CPU architecture and in CUDA* C to use NVIDIA GPU architecture. The existing CUDA C program cannot run on any other hardware architecture GPU or accelerator apart from an NVIDIA GPU. This limits the choice of different architectures, creates vendor lock-in, and forces developers to maintain separate code bases for CPU and GPU architectures.

Solution: Build a Unified Code Base Using Intel® Tools

 

Intel® tools enable unified-language and cross-architecture platform applications to be ported to (and optimized for) multiple heterogeneous architecture-based platforms. Using Intel tools and libraries, the CUDA-native application is migrated to SYCL*, enabling it to run seamlessly on multiple architectures like Intel CPUs, Intel GPUs, and NVIDIA GPUs.

The result: The migrated SYCL version of the Poisson equation solver now comprises a unified code base that can be used to run on multiple architectures without losing performance or accuracy. Optimizations enabled this solution to be more lucrative as it delivers a unified code base with a boost in performance without being vendor locked.

Code Migration to SYCL*

The Poisson Equation Solver has CPU (C++) and GPU (CUDA) sources. As a first step, the Intel® DPC++ Compatibility Tool (available as part of the Intel oneAPI Base Toolkit) was used to migrate CUDA source to the SYCL source. In this case, the Intel DPC++ Compatibility Tool achieved 100% migration in a short time. This made functional porting complete. Figure 2 and figure 3 provide the snippet of CUDA source code to the migrated SYCL source code.

""

Figure 2. Snippet of CUDA source code before migration

 

""

Figure 3. Snippet of SYCL source code after migration

 

Our research group at IIT Goa has developed an in-house CFD solver to simulate incompressible turbulent flows. To port our solver to GPUs, we are working with the unsteady Poisson equation, a model template for our CFD solver. The Poisson equation code is developed in CUDA, limiting its execution to vendor-specific GPUs. Using the SYCLomatic tool of Intel tools, this code is migrated to SYCL, thus opening up other vendor and architectural alternatives. After further optimization, the migrated code runs on Intel Data Center GPU Max Series 1550 and achieves approximately 1.9x speedup compared to the existing GPU solution. We look forward to using the migrated SYCL code on different platform architectures and migrating our CFD solver to SYCL.

– IIT Goa, India

Run and Validate Optimized Results

As a next step, migrated SYCL code is compiled with Intel® oneAPI DPC++/C++ Compiler to generate the executable. To compile the code for NVIDIA GPUs, the Codeplay* oneAPI for NVIDIA GPUs plug-in is used, which adds support for an NVIDIA GPU to the Intel oneAPI Base Toolkit. The executable file runs until a convergence criterion is met. The result was validated by comparing it with the results generated by the CUDA executable.

""

Figure 4. Plots of the output results from the original CUDA code and migrated SYCL code

 

Performance Results and Optimization on an NVIDIA* A100 GPU

The migrated SYCL code was run on an NVIDIA A100 GPU, and some performance degradation was seen when compared to the CUDA code.

""

Figure 5. Performance comparison of optimized SYCL code against CUDA code on an NVIDIA A100 GPU

To identify the performance regression in the SYCL code, we used the NVIDIA Nsight* Systems profiler to generate an application profile for CUDA and SYCL code bases and did a deeper analysis of the same.

""

Figure 6. CUDA API summary of CUDA code on an NVIDIA A100 GPU

 

""

Figure 7. CUDA API summary of migrated SYCL code on an NVIDIA A100 GPU

In the CUDA API summary part of the generated profile for SYCL code, it was observed that there were unnecessary calls to three functions, namely cuEventCreate, cuEventRecord, and cuEventDestroy_v2. Intel oneAPI DPC++/C++ Compiler unnecessarily created these events for the NVIDIA back end.

To overcome this, we used an extension that introduces a discard events property (ext::neAPI::property::queue::discard events) for SYCL queues. SYCL queues are used to submit work to the device. Each work on completion returns an event that can be used in the application for synchronization and other purposes such as event-based dependence. By using the discard events property, the application informs a SYCL implementation that it will not use the event returned by any of the queue member functions.

After incorporating the changes, those extra calls to three functions (cuEventCreate, cuEventRecord, and cuEventDestroy_v2) were eliminated and achieved a 6% performance improvement.

 

""

Figure 8. CUDA API summary of migrated SYCL code after optimization on an NVIDIA A100 GPU

""

Figure 9. Performance comparison of optimized SYCL code against CUDA code on an NVIDIA A100 GPU

Performance Results on Intel® Data Center GPU Max Series 1550: Significant Improvement

Migrated SYCL code compiled with Intel oneAPI DPC++/C++ Compiler to generate binary to run on Intel Data Center GPU Max Series 1550.

The migrated SYCL code of 2D Poisson Equation Solver performed approximately 1.9x faster on Intel Data Center GPU Max Series 1550 in comparison to CUDA code on an NVIDIA A100 GPU.

""

Figure 10. Performance comparison of SYCL code on Intel Data Center GPU Max Series against CUDA code on an NVIDIA A100 GPU

 

Workload: 2D Poisson Equation Solver with two problem sizes

  • Point per block = 1024 x 1024, number of blocks = 32 x 32
  • Point per block = 2048 x 2048, number of blocks = 32 x 32

Hardware configuration:

  • Intel Data Center GPU Max Series with two stacks suitable for HPC and AI workloads. This GPU contains a total of 8 slices, 128 Xe-cores, 1024 vector engines with Xe architecture, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers, and 16 links with Xe architecture.
  • Intel® Xeon® Platinum 8360Y CPU at 2.40 GHz with 72 physical cores, 256 GB of DDR4 memory at 3200 MT/s.
  • NVIDIA A100 GPU with 80 GB HBM2e, base clock: 1065 MHz connected to Intel Xeon Platinum 8360Y CPU.

Software configuration:

  • Operating system: Red Hat* Enterprise Linux* 8
  • Compilers: Intel oneAPI DPC++/C++ Compiler 2023.0.0 , NVIDIA CUDA Compiler (NVCC) 12.0
  • Language and API: C, SYCL, CUDA C

Conclusion

Intel tools made migrating CUDA source code to SYCL easier, which helped IIT Goa overcome vendor lock-in for its 2D Poisson Equation Solver and maintain a single code base for different architectures. The NVIDIA Nsight Systems tool helped identify the performance regression in SYCL binary running on an NVIDIA GPU. By fixing this regression, the SYCL binary running on an NVIDIA A100 GPU roughly matched the performance of CUDA binary running on an NVIDIA A100 GPU. The performance of a SYCL binary on Intel Data Center GPU Max Series 1550 was about 1.9x that of CUDA binary on an NVIDIA A100 GPU. The SYCL code was functional on the AMD HIP back end as well, but performance evaluation is still pending. Migration of the same code to the ARM back end is also planned in the future.

Additional Resources

Download the Tools

Get the Code Samples (GitHub*)