BIH Sets New Benchmark for Advanced Bioinformatics

A case study in genomic analysis demonstrates potential to make bioinformatics more insightful, efficient, and easy to scale.

At a glance:

  • The Berlin Institute of Health (BIH) at Charité is dedicated to improving healthcare through medical translation.

  • To find a new way to make genomic insights more accessible, Intel teamed up with a premier research institute, the BIH, and precision medicine-focused bioinformatics software developer Sentieon. This collaboration developed an optimized analytic pipeline that analyzed long-read next-generation sequencing data using standard data center CPUs from Intel.

author-image

By

Executive Summary

Intel forged a collaboration between innovators in bioinformatics to demonstrate that Intel® Xeon® Scalable processors can help make genomic analysis more scalable and accessible. The resulting proof-of-concept analytic pipeline, running on Dell PowerEdge Servers with Intel data center CPUs, showed stunning performance for processing genomic data.1 Leveraging optimizations for Intel® x86 architecture, the pipeline delivered a gene expression matrix from single-cell long-read sequencing data 14X faster than a standard workflow.1

The proof-of-concept genomic analysis pipeline running on 4th Generation Intel Xeon Scalable processors delivered a result 14x faster than an industry standard pipeline.1

Solving the problems of scalability and cost

Scaling genomic analysis applications has been limited by demanding computational workloads that generally require specialized HPC solutions. This has led to a high cost-per-sample as well as a significant carbon footprint. However, it also presents bioinformatics applications with a new path to scalability and efficiency.

Proof of concept pipeline delivers exceptional speed

To find a new way to make genomic insights more accessible, Intel teamed up with a premier research institute, the Berlin Institute of Health at Charité (Charité BIH), and precision medicine focused bioinformatics software developer, Sentieon. This collaboration developed an optimized analytic pipeline that performed analysis on long-read next-generation sequencing (NGS) data using standard data center CPUs from Intel. The proof-of-concept pipeline built by Sentieon cut the standard time-to-result by more than 14x.1

In further testing, 4th Generation Intel Xeon Scalable processors showed a 21 percent advantage in speed over 3rd Generation Intel Xeon Scalable processors. This offers tremendous potential to bring economies of scale to bioinformatics as each successive generation of Intel® technology helps make genomic analysis software faster and more efficient.1

Achieving far more than speed

This test revealed more than raw power and performance. The highest degree of accuracy was maintained, and thanks to built-in features on Intel Xeon processors, this new approach to genomic analysis offers advancements in trust, scalability, and cost effectiveness.

New 4th Gen Intel Xeon Scalable processors increased the speed of genomic analysis by 21 percent compared to 3rd Gen Intel Xeon Scalable processors on Sentieon’s optimized workflow.1

Challenge: Deep Genomic Insights Have Been Hard to Scale

Our genome has the key to so many of the secrets about how our bodies work and why we get sick. Employing genomic analysis in diagnostics, treatment, and research offers opportunities to tailor treatments and improve results for patients. Studying genomic data can also improve population health and enable early detection and prevention measures that reduce healthcare costs.

Genomic analysis, however, is costly and resource-intensive to perform. Closed and specialized hardware systems have hampered scalability. Together with its global ecosystem of collaborators, Intel is advancing the performance, efficiency, and openness of bioinformatics applications to help make them more powerful, cost effective, and accessible.

The global bioinformatics market is estimated to reach $45.2 billion by 2030, growing at a CAGR of 11.9 percent from 2022-2030.2

NGS is changing genomics, but presents tradeoffs

NGS has introduced cutting-edge processes that dramatically speed up DNA or RNA sequencing. Typically, NGS scales throughput by running short-read sequencing (SRS) on bulk genomic data. An average gene transcript has thousands of base pairs (bp), but SRS reads around only 150 bp. This is enough data to identify specific genes and its expression that could be related to disease or an underlying health condition. Nonetheless, SRS leaves out much of what’s occurring inside the cell or organ, which can make it hard to detect splicing, breaks, fusions, or structural variants in the cell that can be disease-causing.

Pursuing a more precise picture of health

Long-read single-cell RNA sequencing (scRNA-seq), however, reads the majority of full transcripts in individual cells. This enables comparative analysis between cells within the same organ, as well as the identification of distinct cell populations, cell states, and rare cell types. The detailed information provided by long-read scRNA-seq provides deeper insights into gene expression dynamics at a single-cell level and can reveal more information about protein function and variances that might otherwise be missed.

The challenges of long-read scRNA-seq hold back scalability

Oxford Nanopore Technologies (ONT) has advanced the availability of long-read scRNA-seq with high sequencing throughput, generating large volumes of data. ONT sequencers also generate very large volumes of data. Analytics pipelines running ONT libraries often require dedicated hardware acceleration via a GPU or FPGA. This computational complexity has made long-read scRNA-seq prohibitively expensive and time-consuming. It also limits scalability in bioinformatics and precision medicine due to reliance on specialized, inflexible infrastructure and closed software.

Analyzing long-read scRNA-seq data also generates a significant carbon footprint as it requires computing resources that consume large amounts of energy. It can be hard for healthcare organizations to pursue carbon emission objectives, such as net zero compute, while also developing and using advanced bioinformatics.

Solution: Genomic Analysis Pipeline Optimized for Intel® Data Center CPUs

Bringing about the future of bioinformatics means overcoming the performance, efficiency, and scalability issues presented by long-read sequencing data. Running genomic analysis workloads on data center CPUs—as opposed to specialized high-performance computing (HPC) workstations—can reduce cost-per-sample through economies of scale and efficiency advantages.

Bringing innovators together to improve healthcare

Intel has collaborated with Charité BIH for years to help advance its bioinformatics initiatives. Intel saw that another of its collaborators, Sentieon, could offer Charité BIH cutting-edge software that could advance the capabilities of the Charité Clinical Cloud.

Intel introduced the teams, and Sentieon developed a new genomic analysis pipeline and workflow for long-read sequencing data. This workflow is an improvement over the recently released open-source solution (Sockeye) implemented by ONT, and optimizations available for Intel architecture helped amplify performance and efficiency.

Evaluating the power of Intel® Xeon® Scalable processors for advanced bioinformatics

The test sequencing was performed on Intel Xeon Scalable processors. No specialized hardware acceleration was employed. The Intel® CPUs employed for the proof of concept are those available in standard cloud instances across the globe.

Streamlined workflow, optimized pipeline

Sentieon designed the pipeline using the readily available Sockeye ONT pipeline as a starting point. The pipeline also combined the adapter scanning and barcode extraction stages into a single tool inside the Sentieon software. This removed a data processing bottleneck and eliminated the reliance on vsearch for adapter identification. ONT longreads are aligned using Sentieon’s minimap2, which delivered greatly increased performance after being optimized for Intel architecture.

Figure 1. The custom sequencing software pipeline developed by Sentieon can dramatically accelerate long-read scRNAseq on standard cloud infrastructure.

Unique Molecular Identifiers (UMI) and barcode tasks such as extraction, deduplication, and correction present a major part of the workload challenge. The Sentieon workflow takes advantage of the power of Intel Xeon Scalable processors to run scanning, extraction, and alignment concurrently. This reduced read/write time and eliminated the need for intermediary files. Sentieon’s experience in UMI handling also enabled the company to write code that improves deduplication performance by removing dependency on external tools for UMI handling.

The success of the initiative was enhanced by systems integrator SVA, which led the process of building the Dell PowerEdge Server to run Sentieon’s software pipeline and integrating it into Charité BIH’s private cloud.

Results: Dramatically Faster Times and the Promise of Efficiency and Scalability

Benchmarking the proof-of-concept analytic pipeline from Sentieon against a standard ONT workflow demonstrated its capability to deliver results in much less time. In successive tests with different ONT long-read scRNA-seq data sets, the Sentieon workflow installed on 4th Gen Intel Xeon processors was 14X faster than running the standard ONT Sockeye workflow.1 It was the fastest time to produce a gene expression matrix from long-read sequencing data that the researchers performing the testing had ever seen.

Figure 2. 4th Generation Intel® Xeon® Scalable processors delivered incredible speeds in genomic analysis on ONT data.1

Generational gains show potential for long-term scaling

The case study also tested how 3rd Gen Intel Xeon Scalable processors, available throughout the world in the cloud, performed on the streamlined workflow. Thanks to Sentieon’s optimizations for Intel server architecture, 3rd Gen Intel Xeon Scalable processors completed the genomic analysis task 11X faster than the benchmark HPC workstation.1

Furthermore, 4th Gen Intel Xeon processors demonstrated a 21 percent advantage over the previous generation.1 This shows the potential for bioinformatics applications to gain performance and efficiency every time workloads move to cloud instances with the latest generation of Intel data center CPUs. With Intel, such migrations are typically seamless and require little effort or expense.

“SVA was available throughout the project as the operator of the infrastructure, supporting communication between the various parties. In addition, SVA designed, built and operated the clinical cloud and supports the Charité team. All new workloads were integrated to the cluster with the help of SVA.”—Daniel Vois, head of sales, SVA Healthcare/Germany

The speed to launch the next generation of genomic analysis

Achieving this level of performance for long-read sequencing data analysis can enable hospitals and research institutions to get results at speeds that can change the way they work. What’s more, delivering genomic analysis over the cloud can allow practitioners to make greater use of NGS data.

Figure 3. Comparing the genomic analysis performance of Intel® Xeon® Scalable processors gen-over-gen reveals a significant advantage for the latest generation.1

Performance and efficiency for the future of healthcare: 4th Generation Intel Xeon Scalable processors

The latest generation of Intel Xeon Scalable processors offers a cost-effective server architecture that allows you to work with larger datasets with less latency.3 4 Highlights include:

 

  • DDR5 memory and PCIe 5.0 for increased memory and I/O bandwidth
  • Higher clock speeds and up to 60 cores to handle bigger tasks with fewer racks
  • Built-in accelerators that can speed up key workloads like AI and analytics
  • Up to 53% general purpose compute performance over 3rd Generation Intel Xeon Scalable processors3 4

Figure 4. Sentieon’s optimized workflow for Intel® Xeon® Scalable processors showed potential to perform genomic analysis with scalable infrastructure and fewer resources.

Conclusion: Maximize Speed, Efficiency, Scalability, and Trust

“Omics” technologies are the future of healthcare. Enabling long-read sequencing in the Charité Clinical Cloud on standard hardware generates potential use cases that could have a huge impact. Genomic analysis can power precision medicine that offers personalized treatments. Screening populations of children can lead to early disease detection. Genomic data can allow virologists to see the impact a virus has on a molecular level.

Enabling fast and efficient genomics analysis on standard Intel data center CPUs offers a pathway to bring bioinformatics applications into the cloud. In a cloud environment powered by Intel technology, bioinformatics workloads can be easily migrated and accessed from virtually anywhere. What’s more, data centers present an opportunity to centralize security, sustainability, and regulatory compliance, which can eliminate major hurdles to launching an application that uses health data. Giving practitioners and researchers tools capable of inspiring their trust can empower everyone in the healthcare value chain to deliver better care.

Spotlight on The Berlin Institute of Health at Charité

The Berlin Institute of Health at Charité (Charité BIH) has a special mission to push the boundaries of medical translation to improve health for everyone. Rather than developing specialized, purpose-built solutions for analytics and bioinformatics, Charité BIH sees the future of healthcare in the cloud. At the heart of this effort is the Charité Clinical Cloud. It promises to offer health data platforms that bring hospitals, doctors, researchers and other specialists together while serving the entire healthcare value chain.

This focus on making advanced healthcare technology more available is what has driven the company’s long-standing collaboration with Intel. Charité BIH and Intel have worked together to optimize workflows, pipelines, and data center infrastructure for healthcare research. Together, they have achieved breakthroughs such as identifying a method COVID-19 uses to target the body’s cells for infection. The Charité Clinical Cloud enables bioinformaticians to scale dynamically, move workloads around and customize applications. It also centralizes complex security and compliance measures. With offerings like this, Intel and Charité BIH want to maximize speed, performance, and patient experience quality, while bringing down cost and complexity.

About Sentieon

Sentieon develops highly optimized and accurate software and algorithms for bioinformatics applications, winning many precisionFDA awards and helping customers all over the world process their genomic data.

About Dell

Enhance both business agility and time to market with flexible and scalable Dell PowerEdge Servers. Offered in a variety of form factors—each designed to enhance security—Dell PowerEdge Servers can optimize your IT operations and simplify deployment and management.

About SVA

SVA System Vertrieb Alexander GmbH is one of the leading German system integrators. Founded in 1997 and based in Wiesbaden, Germany, the company has more than 2,700 employees at 27 branch offices all over Germany. The corporate objective of SVA is the combination of high-quality IT products from different vendors with the project know-how and flexibility of SVA to achieve optimal solutions for customers.

 

Download the PDF ›