The seismic modeling application’s single code base now can run seamlessly across CPUs and GPUs. It had a performance improvement from CUDA* on an NVIDIA* A100 GPU to SYCL* on an Intel® Data Center GPU Max Series.
Summary
An India-based premier research and development (R&D) organization used tools in the Intel® oneAPI Base Toolkit to free itself from vendor hardware lock-in by migrating its open source seismic modeling application from CUDA to SYCL*. As a result, application performance improved by 1.75x on Intel® Data Center GPU Max Series 1550 when compared to NVIDIA A100* platform performance.
Introduction
C-DAC (Center for Development of Advanced Computing) is the premier R&D organization of India’s Ministry of Electronics and Information Technology for R&D in IT, electronics, and associated areas. Created in 1987, its research spans multiple industries and domains such as HPC, cloud computing, embedded systems, cybersecurity, bioinformatics, geomatics, and quantum computing. In the realm of geophysical exploration, C-DAC has developed an open source seismic modeling application: SeisAcoMod2D. It performs acoustic wave propagation of multiple source locations for the 2D subsurface earth model using finite difference time-domain modeling.
Challenge: Vendor Hardware Lock-In
Fueled by high-computational throughput and energy efficiency, there has been a quick adoption of GPUs as computing engines for high-performance computing (HPC) applications in recent years. The growing representation of heterogeneous architectures combining general-purpose multicore platforms and accelerators has led to the redesign of codes for parallel applications.
SeisAcoMod2D is developed in C to efficiently use the multicore CPU architecture and in CUDA C to make use of NVIDIA GPU architecture. The existing CUDA C program cannot run on Intel GPUs (nor any other vendor GPUs, for that matter). This limits architecture choice, creates vendor lock-in, and forces developers to maintain separate code bases for CPU and GPU architectures.
Solution: Build a Single Code Base Using Intel® Tools
Intel® tools enable single-language and cross-architecture platform applications to be ported to (and optimized for) multiple single and heterogeneous architecture-based platforms. Using a combination of optimized tools and libraries, the application’s native CUDA code was migrated to SYCL, enabling it to run seamlessly on Intel CPUs and GPUs.
The result: SeisAcoMod2D is now comprised of a single code base that can be used to run it on multiple architectures without losing performance. This was a perfect package for C-DAC: a single language with a boost in performance without being vendor-locked.
Let’s walk-through the steps, tools, and results.
Code Migration to CUDA
SeisAcoMod2D has both CPU (C++) source and GPU (CUDA) source. As a first step, the Intel® DPC++ Compatibility Tool (available as part of the Intel oneAPI Base Toolkit) was used to migrate CUDA source to SYCL source. In this case, the Intel DPC++ Compatibility Tool was able to achieve 100% migration in a very short time. This made functional porting of the seismic modeling complete. Figure 1 and Figure 2 provide the snippet of CUDA source code to migrated SYCL source code.
Due to the presence of multiple CUDA streams with async calls, the migrated code needed the placement of appropriate barrier/wait calls or a single SYCL queue to maintain the data consistency. Incorporating these solutions resolved the correctness issue. The changes are shown in Figure 3 and Figure 4.
Intel DPC++ Compatibility Tool migration from CUDA streams to SYCL queues:
User modification of multiple SYCL queues to a single SYCL queue:
Our open source seismic modeling application, SeisAcoMod2D, CUDA code was migrated to SYCL using SYCLomatic easily. The migrated code efficiently runs on Intel Data Center GPU Max Series and achieves competitive performance compared to currently available GPU solutions. As we look to the future, the combination of Intel® Xeon® CPU Max Series with high-bandwidth memory plus Intel Data Center GPU Max Series presents us with a seamless upgrade path, accelerating our applications without the need for code changes, thanks to using Intel toolkits.
Code Optimization
As a next step, Intel® VTune™ Profiler was used to profile the kernels running on a GPU to identify the bottlenecks and tuning opportunities. Intel VTune Profiler supports GPU offload and GPU compute media hot spots analysis, which help analyze the most time-consuming GPU kernels and identify if the application is CPU- or GPU-bound.
Figure 5 shows the GPU offload analysis result from the SYCL binary of SeisAcoMod2D with kernels running on an Intel GPU. As highlighted, the memset call doing the memory transfer from host to device is taking more time; the memset uses the copy engine in a GPU. This was replaced with a fill-function call, which serves the same purpose but uses a compute engine to make it fast. The same can be seen in Figure 6 where the time taken for memory transfer was drastically reduced.
Performance Results
oneAPI with an Intel GPU (Intel Data Center GPU Max Series) helped to speed up C-DAC’s seismic modeling application runtime on the GPU by 7x compared to CPU baseline, and by approximately 1.75x compared to NVIDIA A100 platforms.
The CPU thread reads and moves data related to seismic source locations to GPU memory and calls the GPU kernels for computation. The GPU then computes the wavefield propagation forward in time. After finalization of the time-iteration loop, the CPU thread copies the computed synthetic seismogram from GPU memory to CPU memory and writes it to the file system.
Workload: seismic workload from C-DAC
Hardware configuration:
- Intel Data Center GPU Max Series with two stacks suitable for HPC and AI workloads. This GPU contains a total of 8 slices, 128 Xe-cores, 1024 vector engines with Xe architecture, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers, and 16 Intel® Xe Links.
- Intel Xeon Platinum 8360Y CPU 2.40 GHz having 72 physical cores, 256 GB of DDR4 memory at 3200 MT/s.
- Intel Xeon CPU Max Series 9480, 1.90 GHz having 112 physical cores, 64 GB HBM memory, 256 GB of DDR5 memory at 4800 MT/s.
- NVIDIA A100 GPU having 80 GB HBM2e, base clock: 1065 MHz connected to an Intel Xeon Platinum 8360Y CPU.
Software configuration:
- Operating system: Red Hat* Enterprise Linux* 8
- Compilers: Intel® C++ Compiler 2023.0, nvcc 11.7
- Language and API: C, SYCL, CUDA C, OpenMP*
- Testing date: June 20, 2023
Testing date: June 20, 2023
Hardware configuration:
- Intel Data Center GPU Max Series, with 2 stacks suitable for HPC and AI workloads. This GPU contains a total of 8 slices, 128 Xe-cores, 1024 vector engines with Xe architecture, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers and 16 Intel Xe Links.
- Intel Xeon Platinum 8360Y CPU at 2.40 GHz having 72 physical cores, 256 GB of DDR4 memory at 3200 MT/s.Intel Xeon CPU Max Series 9480 1.90 GHz having 112 physical cores, 64 GB HBM memory, 256 GB of DDR5 memory at 4800 MT/s.
- NVIDIA A100 GPUs having 80 GB HBM2e, base clock: 1065 MHz connected to an Intel Xeon Platinum 8360Y CPU.
Software configuration:
- Operating system: Red Hat Enterprise Linux 8
- Compilers: Intel C++ Compiler 2023.0, nvcc 11.7
- Language and API: C, SYCL, CUDA C, OpenMP
Conclusion
Using Intel tools made it easier to migrate CUDA source code to SYCL, which helped C-DAC overcome vendor lock-in for its optimized SeisAcoMod2D seismic modeling application and maintain a single code base for different architectures. Intel VTune Profiler helped to identify the bottlenecks in SYCL binary running on an Intel GPU. Fixing these bottlenecks yielded a performance improvement of 7x from a baseline CPU and 1.75x from native CUDA on an NVIDIA A100 GPU compared to SYCL code on an Intel GPU.
Additional Resources
Download the Tools
- Get the full complement of Intel tools in the Intel oneAPI Base Toolkit.
- Or download stand-alone versions:
Get the Code Samples (GitHub*)