C-DAC Achieves 1.75x Performance Improvement

The seismic modeling application’s single code base now can run seamlessly across CPUs and GPUs. It had a performance improvement from CUDA* on an NVIDIA* A100 GPU to SYCL* on an Intel® Data Center GPU Max Series.

Summary

An India-based premier research and development (R&D) organization used tools in the Intel® oneAPI Base Toolkit to free itself from vendor hardware lock-in by migrating its open source seismic modeling application from CUDA to SYCL*. As a result, application performance improved by 1.75x on Intel® Data Center GPU Max Series 1550 when compared to NVIDIA A100* platform performance.

Introduction

C-DAC (Center for Development of Advanced Computing) is the premier R&D organization of India’s Ministry of Electronics and Information Technology for R&D in IT, electronics, and associated areas. Created in 1987, its research spans multiple industries and domains such as HPC, cloud computing, embedded systems, cybersecurity, bioinformatics, geomatics, and quantum computing. In the realm of geophysical exploration, C-DAC has developed an open source seismic modeling application: SeisAcoMod2D. It performs acoustic wave propagation of multiple source locations for the 2D subsurface earth model using finite difference time-domain modeling.

Challenge: Vendor Hardware Lock-In

Fueled by high-computational throughput and energy efficiency, there has been a quick adoption of GPUs as computing engines for high-performance computing (HPC) applications in recent years. The growing representation of heterogeneous architectures combining general-purpose multicore platforms and accelerators has led to the redesign of codes for parallel applications.

SeisAcoMod2D is developed in C to efficiently use the multicore CPU architecture and in CUDA C to make use of NVIDIA GPU architecture. The existing CUDA C program cannot run on Intel GPUs (nor any other vendor GPUs, for that matter). This limits architecture choice, creates vendor lock-in, and forces developers to maintain separate code bases for CPU and GPU architectures.

Solution: Build a Single Code Base Using Intel® Tools

Intel® tools enable single-language and cross-architecture platform applications to be ported to (and optimized for) multiple single and heterogeneous architecture-based platforms. Using a combination of optimized tools and libraries, the application’s native CUDA code was migrated to SYCL, enabling it to run seamlessly on Intel CPUs and GPUs.

The result: SeisAcoMod2D is now comprised of a single code base that can be used to run it on multiple architectures without losing performance. This was a perfect package for C-DAC: a single language with a boost in performance without being vendor-locked.

Let’s walk-through the steps, tools, and results.

Code Migration to CUDA

SeisAcoMod2D has both CPU (C++) source and GPU (CUDA) source. As a first step, the Intel® DPC++ Compatibility Tool (available as part of the Intel oneAPI Base Toolkit) was used to migrate CUDA source to SYCL source. In this case, the Intel DPC++ Compatibility Tool was able to achieve 100% migration in a very short time. This made functional porting of the seismic modeling complete. Figure 1 and Figure 2 provide the snippet of CUDA source code to migrated SYCL source code.

Figure 1. Snippet of CUDA source before migration

Figure 2. Snippet of SYCL source after migration

Due to the presence of multiple CUDA streams with async calls, the migrated code needed the placement of appropriate barrier/wait calls or a single SYCL queue to maintain the data consistency. Incorporating these solutions resolved the correctness issue. The changes are shown in Figure 3 and Figure 4.

Intel DPC++ Compatibility Tool migration from CUDA streams to SYCL queues:

User modification of multiple SYCL queues to a single SYCL queue:

Figure 4. Single SYCL queue creation

Our open source seismic modeling application, SeisAcoMod2D, CUDA code was migrated to SYCL using SYCLomatic easily. The migrated code efficiently runs on Intel Data Center GPU Max Series and achieves competitive performance compared to currently available GPU solutions. As we look to the future, the combination of Intel® Xeon® CPU Max Series with high-bandwidth memory plus Intel Data Center GPU Max Series presents us with a seamless upgrade path, accelerating our applications without the need for code changes, thanks to using Intel toolkits.

– C-DAC, India

Code Optimization

As a next step, Intel® VTune™ Profiler was used to profile the kernels running on a GPU to identify the bottlenecks and tuning opportunities. Intel VTune Profiler supports GPU offload and GPU compute media hot spots analysis, which help analyze the most time-consuming GPU kernels and identify if the application is CPU- or GPU-bound.

Figure 5 shows the GPU offload analysis result from the SYCL binary of SeisAcoMod2D with kernels running on an Intel GPU. As highlighted, the memset call doing the memory transfer from host to device is taking more time; the memset uses the copy engine in a GPU. This was replaced with a fill-function call, which serves the same purpose but uses a compute engine to make it fast. The same can be seen in Figure 6 where the time taken for memory transfer was drastically reduced.

Figure 5. GPU offload analysis as a result of a SYCL binary having a memset function call.

Figure 6. GPU offload analysis as a result of a SYCL binary having a fill function call.

Performance Results

oneAPI with an Intel GPU (Intel Data Center GPU Max Series) helped to speed up C-DAC’s seismic modeling application runtime on the GPU by 7x compared to CPU baseline, and by approximately 1.75x compared to NVIDIA A100 platforms.

The CPU thread reads and moves data related to seismic source locations to GPU memory and calls the GPU kernels for computation. The GPU then computes the wavefield propagation forward in time. After finalization of the time-iteration loop, the CPU thread copies the computed synthetic seismogram from GPU memory to CPU memory and writes it to the file system.

Figure 7. Seismic workload run time on Intel® Xeon® Platinum 8360Y CPU and Intel Data Center GPU Max Series.

Workload: seismic workload from C-DAC
Hardware configuration:

Intel Data Center GPU Max Series with two stacks suitable for HPC and AI workloads. This GPU contains a total of 8 slices, 128 X^e-cores, 1024 vector engines with X^e architecture, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers, and 16 Intel® X^e Links.
Intel Xeon Platinum 8360Y CPU 2.40 GHz having 72 physical cores, 256 GB of DDR4 memory at 3200 MT/s.
Intel Xeon CPU Max Series 9480, 1.90 GHz having 112 physical cores, 64 GB HBM memory, 256 GB of DDR5 memory at 4800 MT/s.
NVIDIA A100 GPU having 80 GB HBM2e, base clock: 1065 MHz connected to an Intel Xeon Platinum 8360Y CPU.

Software configuration:

Operating system: Red Hat* Enterprise Linux* 8
Compilers: Intel® C++ Compiler 2023.0, nvcc 11.7
Language and API: C, SYCL, CUDA C, OpenMP*
Testing date: June 20, 2023

Figure 8. Seismic workload run time on NVIDIA A100 GPU and Intel Data Center GPU Max Series.

Testing date: June 20, 2023

Hardware configuration:

Intel Data Center GPU Max Series, with 2 stacks suitable for HPC and AI workloads. This GPU contains a total of 8 slices, 128 X^e-cores, 1024 vector engines with X^e architecture, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers and 16 Intel X^e Links.
Intel Xeon Platinum 8360Y CPU at 2.40 GHz having 72 physical cores, 256 GB of DDR4 memory at 3200 MT/s.Intel Xeon CPU Max Series 9480 1.90 GHz having 112 physical cores, 64 GB HBM memory, 256 GB of DDR5 memory at 4800 MT/s.
NVIDIA A100 GPUs having 80 GB HBM2e, base clock: 1065 MHz connected to an Intel Xeon Platinum 8360Y CPU.

Software configuration:

Operating system: Red Hat Enterprise Linux 8
Compilers: Intel C++ Compiler 2023.0, nvcc 11.7
Language and API: C, SYCL, CUDA C, OpenMP

Conclusion

Using Intel tools made it easier to migrate CUDA source code to SYCL, which helped C-DAC overcome vendor lock-in for its optimized SeisAcoMod2D seismic modeling application and maintain a single code base for different architectures. Intel VTune Profiler helped to identify the bottlenecks in SYCL binary running on an Intel GPU. Fixing these bottlenecks yielded a performance improvement of 7x from a baseline CPU and 1.75x from native CUDA on an NVIDIA A100 GPU compared to SYCL code on an Intel GPU.

Additional Resources

Download the Tools

Get the full complement of Intel tools in the Intel oneAPI Base Toolkit.
Or download stand-alone versions:
- Intel DPC++ Compatibility Tool
- Intel VTune Profiler

Get the Code Samples (GitHub*)

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

C-DAC Achieves 1.75x Performance Improvement on Seismic Code Migration from CUDA* to SYCL*

Summary

Introduction

Challenge: Vendor Hardware Lock-In

Solution: Build a Single Code Base Using Intel® Tools

Code Migration to CUDA

Code Optimization

Performance Results

Conclusion

Additional Resources

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

C-DAC Achieves 1.75x Performance Improvement on Seismic Code Migration from CUDA* to SYCL*

Summary

Introduction

Challenge: Vendor Hardware Lock-In

Solution: Build a Single Code Base Using Intel® Tools

Code Migration to CUDA

Code Optimization

Performance Results

Conclusion

Additional Resources

Product and Performance Information