Accelerating Molecular Modeling

HPC workload achieved the best performance on an FPGA, even when using a high-level development workflow.

At a glance:

  • The Institute for Advanced Chemistry of Catalonia (CSIC) focuses on developing new techniques to improve the field of molecular structures.

  • The challenge with iterative methodology for verifying the structure of molecules is that it requires computational effort in the verification phase. The scientists at CSIC turned to the Barcelona Supercomputing Center and their High Performance Computing team for help. The HPC team at BSC determined that an FPGA would be the best fit for the CSIC algorithm.

author-image

By

High performance computing (HPC) workloads typically default to targeting CPUs or GPUs. However, with powerful Intel® FPGA accelerators now available, this is changing. As demonstrated in this case study, an HPC workload can achieve the best performance on an FPGA, even when using a high-level development workflow. This case study explains the application's details, tool flow, and migration to the latest Intel Agilex® 7 FPGA technology. Other researchers can consider this information for their own HPC applications.

Introduction

Molecular modeling has become an increasingly important tool in modern scientific research. With the help of computational methods, scientists can create complex models of molecules and study their properties and behaviors. However, creating these models involves billions of calculations, and researchers must overcome computational challenges to realize their goals.

Molecular Modeling Challenges

The Institute for Advanced Chemistry of Catalonia (CSIC) focuses on developing new techniques to improve the field of molecular structures. Knowledge of molecular structures can be used in biomedicine to create new medicines and treatments. The techniques for developing computational molecular models and verifying their structure are well-known and understood.

The iterative methodology for verifying the structure of molecules involves the following steps:

 

  • Create a computational molecular model.
  • Generate its computational spectra.
  • Compare the computational spectra to the known analytic spectra of the molecule.
  • If the spectra match, the structure of the molecule is identical.
  • If they do not match, refine the computation model, and generate a new spectra.

 

The challenge with this methodology is the computational effort required in the verification phase.

To analyze a simple molecule with 2 million data points using an Intel® Xeon® Gold processor (3.2 GHz, one core, one thread), the computation takes 9644.544 seconds or 2 hours and 41 minutes. This only allows for about 3 or 4 iterations per working day and significantly impacts the development time.

Since this was clearly a high performance computing problem, the scientists at CSIC turned to the Barcelona Supercomputing Center (BSC) for help.

HPC: Perfect for FPGAs

BSC is well-known for its world-class computing resources, particularly its heterogeneous data center architecture equipped with CPUs, GPUs, and FPGAs. Since they knew a CPU was too slow, they sought to determine whether a GPU or FPGA would better suit their needs.

The team at BSC examined the algorithm, which was written as an OpenCL kernel.

Key Algorithm Features:

The algorithm uses both single-precision and double-precision floats, which are efficiently supported by GPUs and FPGAs.

There are nested for-loops with an upper limit or potential upper limit of N, where N is 2 million.

Using a GPU would require completely unrolling all the for-loops, creating huge amounts of hardware, requiring a large GPU, and resulting in high power consumption.

Would using an FPGA be a more efficient implementation option? Would using FPGAs provide a challenge for developing the workload?

FPGA Benefits in HPC

FPGAs can address workloads that GPUs are not optimized for. Although FPGAs use custom hardware to implement algorithms, today’s options include high-level tool flows like oneAPI that bring FPGA performance to software developers.

The unique capabilities of FPGAs are:

 

  • Performance: FPGAs have a fully flexible architecture that can best fit the algorithm. This means that the algorithm does not need to be adapted to the fixed architecture of CPUs and GPUs. Incoming data can be processed directly from memory without using the CPU.
  • Programmability: FPGA workloads can be changed on-the-fly and updated with the latest algorithmic developments. They can run workloads in parallel, and large workloads can be spread out over multiple FPGAs using their rich I/O resources.
  • Productivity: FPGAs can increase productivity through the Open FPGA Stack (OFS) making it easy to install OFS cards in existing servers and configure the system. 
  • Power: Power consumption is extremely critical. FPGAs can execute in fewer clock cycles than CPUs and GPUs and at lower clock frequencies, resulting in lower power consumption and lower costs.
  • Price: There are many options for FPGA acceleration cards. Installing FPGA cards to boost performance can be more cost-effective than renewing existing solutions.

 

The HPC team at BSC determined that an FPGA would be the best fit for this algorithm.

Legacy Accelerators

The HPC acceleration team at BSC already had access to two Intel® Programmable Acceleration Card (Intel® PAC) solutions:

 

 

They used the existing OpenCL algorithm and calculated the computational spectra on the Intel PACs.

The results are captured in the table below and can be summarized as follows:

 

  • The initial results generated on the Intel PAC with Intel Arria 10 GX FPGA demonstrated an acceleration of 17.8X over the CPU results, reducing the kernel execution time from approximately 10K seconds down to 540.457.
  • By then taking advantage of the FPGA's adaptable architecture, BSC was able to convert the 64-bit double-precision floating point accumulator into to a 40-bit integer data type, with an acceptable accuracy loss. Replacing floating-point calculations with arbitrary precision data types is unique to an FPGA and further reduced the processing time to 274.02 seconds.
  • Repeating this process with the Intel® FPGA PAC D5005 based on Intel® Stratix® devices, further reduced the processing time down to 81 seconds.

While the FPGA results are impressive, they were created using two older generation FPGA technologies and used OpenCL, which was first introduced 16 years ago. The team at BSC wondered if they could achieve even better performance by using the latest tools and silicon.

A Flexible Composable Solution for the Modern Data Center

Modern Silicon: Intel Agilex® FPGAs

Intel’s latest FPGA technology is the Intel Agilex 7 FPGA. Intel Agilex 7 F-Series is designed to deliver high-performance computing in various applications. Its advanced architecture combines the benefits of FPGA and CPU, which enables it to deliver high throughput and low-latency performance.

Another benefit of Intel Agilex 7 F-Series is its power efficiency. It is designed with advanced power management features that reduce power consumption while maintaining high performance. The FPGA fabric in Intel Agilex 7 F-Series is built using 10 nm process technology, which enables it to deliver high-performance computing at low power. This power efficiency is crucial in applications that require high-performance computing while minimizing power consumption, in situations such as data centers, edge computing, and autonomous vehicles.

Security is also a critical concern in any modern data center, and Intel Agilex 7 F-Series addresses this concern by providing advanced security features. It is designed with an embedded security subsystem that provides secure boot and runtime security. It also has a built-in hardware root of trust that ensures the authenticity of the system. These security features are essential in applications that store sensitive data, such as financial institutions, government agencies, and healthcare providers.

Modern Toolchain: Intel® oneAPI Base Toolkit

Intel® oneAPI Base Toolkit (Base Kit) is a software development toolkit designed to simplify the process of creating high-performance, cross-architecture applications. It is built on SYCL and can work across a range of processors, including CPUs, GPUs, FPGAs, and AI accelerators.

SYCL (pronounced “sickle”) is an open standard, maintained by The Khronos Group. It is a royalty-free, cross-platform abstraction layer that allows developers to write code for heterogeneous processors using ISO C++. Both host and kernel code can be contained in the same source file.

One of the main benefits of oneAPI is that it simplifies the development process. With Intel oneAPI Base Toolkit (Base Kit) developers can create applications that run on different architectures without learning different programming languages. This means that developers can write code once and run it on different processors, saving a lot of time and effort.

The cross-architecture compatibility of oneAPI means that developers can future-proof their applications, knowing that they will run on a range of processors in the future. This can be particularly important for developers who want to create applications that can run on different types of hardware. This makes it ideal for a facility like BSC.

BSC Had Questions

Using Intel Agilex 7 devices and oneAPI could further improve the efficiency of molecular modeling but these changes introduced challenges:

The OpenCL source would need to migrate to SYCL. What challenges would that create?

BSC could acquire Intel Agilex 7 FPGA devices, but could they easily implement an FPGA acceleration card?

Code Migration Proved Easy

The investigation by BSC highlighted two aspects of SYCL that make the code migration from OpenCL easy:

 

  • Migrating the Kernel Code: SYCL is designed to work seamlessly with other C++ libraries and frameworks, making it easy to integrate into existing C++ codebases.
  • Updating the Host code: SYCL provides a high-level API that abstracts away many of the details of parallel programming. This makes it much easier for developers to get started with heterogeneous programming.

 

Migrating the Kernel Code

The OpenCL kernel code was migrated with no functional changes. The only changes were to address SYCL syntax:

 

  • The OpenCL kernel declaration was adjusted to the SYCL lambda function.
  • Buffer accessors were added for each kernel argument.

 

The reasons for such minor changes are easily explained. The primary improvement SYCL brings to heterogeneous programming over OpenCL is the simpler but more robust interface specification between the host and the kernel. Because SYCL is built on C++17, other C++ based code can be easily integrated. Other than syntax changes to match the new SYCL interface model, most OpenCL kernel code needs little modification when moving to SYCL.

Updating the Host Code

The original OpenCL host code was greatly simplified because many details that must be explicitly managed in OpenCL are handled automatically by SYCL. The original code consisted of 566 lines, while the updated SYCL code only had 285 lines.

As expected, the host code, where the interface between the host and kernel is specified, was the most impacted by the migration. However, since SYCL has a much more robust interface specification and manager, many of the tasks required in OpenCL are no longer necessary. As a result, the largest update is the removal of declarations and variables that are no longer required.

Acceleration Cards from BittWare

Developing any acceleration card takes time and effort. It's not just about adding components to a board. The FPGA device requires all the supporting interface and communication logic around the workload to communicate with the CPU.

Today’s FPGAs push the performance requirements for building a suitable PCIe card significantly, from signal integrity, including DDR5 on the latest devices, to thermal management considerations. Building a suite of robust tools for tasks like environmental monitoring or card-level security have made investing in a card design significantly higher risk than just a few years ago.

To solve this development challenge, BSC turned to Intel Titanium partner BittWare, part of Molex.

BittWare designs and manufactures enterprise-class acceleration products using Intel FPGAs. They were the manufacturer of the original Intel PAC cards at BSC, but today offer PCIe cards with the latest Intel Agilex 7 devices. Customers look to the strength of BittWare (and its parent company Molex) as a significant risk-reducing factor versus attempting to build their own card.

These high-performance programmable accelerators enable customers to quickly develop and deploy Intel FPGA solutions with low risk.

BittWare sees several broad market areas: computing, networking, and storage. Many workloads are well-suited to FPGA, such as natural language recognition, recommendation engines, network monitors, inference, secure communication, analytics, compression, and search, among others. BittWare offers many products based on Intel FPGAs, which support oneAPI in addition to traditional RTL.

BSC could solve their development problem by buying an off-the-shelf BittWare IA-840f card, which uses an Intel Agilex 7 A027 device.

Results: Above and Beyond

The results of using Intel oneAPI Base Toolkit (Base Kit) and the BittWare IA-840f acceleration card surpassed expectations.

 

  • The source code migrated easily and seamlessly.
  • The BittWare Acceleration card plugged in and worked with oneAPI.

Improved Performance

Using the latest tools and silicon, BSC was able to regenerate the earlier results and add them to the table below. The algorithm executed in 61 seconds on the new platform and 41 seconds for the customized accumulator version, which is a 233x improvement over the CPU and about 13x improvement over the initial OpenCL Arria solution.

Beyond Raw Performance

Unlike CPUs or GPUs, FPGAs do not have a fixed architecture that constrains developers. This adaptability allows developers to think outside the box. BSC took advantage of this feature and improved the algorithm's performance by reducing the accuracy of the accumulation.

Double-precision accuracy is achieved through iterative calculations that cost clock cycles, but an FPGA is not limited to standard C++/SYCL/OpenCL data types. It can create a data type of any arbitrary width. After analyzing the data ranges, BSC determined that the double-precision floating-point accumulator could be replaced with a 40-bit integer accumulator without any loss in accuracy of the results.

This allows developers to improve their algorithm, rather than simply rely on whatever raw performance the silicon can provide. After updating the code and running oneAPI again to create a new implementation, the verification time was reduced to only 41 seconds, providing a speedup of 233x over the original CPU.

Conclusion

A growing number of case studies in HPC are showing that the latest Intel Agilex FPGA solutions can compete with other architectures, not only in workload performance but in the requirements for ease of development. Furthermore, off-the-shelf PCIe cards are available to evaluate performance, or build compute clusters, knowing there will be a growing demand for heterogenous solutions in HPC.

Learn More