AWS HPC Blog
Benchmarking the NVIDIA Clara Parabricks germline pipeline on AWS
This blog post was contributed by Ankit Sethia, PhD, and Timothy Harkins, PhD, at NVIDIA Parabricks, and Olivia Choudhury, PhD, Sujaya Srinivasan, and Aniket Deshpande at AWS.
This blog provides an overview of NVIDIA’s Clara Parabricks along with a guide on how to use Parabricks within the AWS Marketplace. It focuses on germline analysis for whole genome and whole exome applications using GPU accelerated bwa-mem and GATK’s HaplotypeCaller.
Introduction
Next generation sequencing (NGS) platforms have outperformed Moore’s Law when it comes to cost per genome, and short-read sequencing platforms can now readily sequence 96 to 384 whole human genomes in 1-2 days, generating terabytes of data per instrument run [1]. With this increased throughput, whole genome sequencing (WGS) of tens of thousands to hundreds of thousands of human genomes is now common as seen in many national programs to sequence their local populations [2].
Furthermore, WGS and whole exome sequencing (WES) are becoming very important to healthcare systems in addressing undiagnosed issues in the neonatal intensive care units, monitoring cancer treatments, and diagnosing and identifying risk factors for complex disorders like autism or cardiovascular disease. As NGS throughput is increasing, further bringing down the cost per sample, data volumes are increasing exponentially. As a result, data storage, management, and analysis are causing a major bottleneck in the overall workflow and increasing the underlying cost. As state-of-the-art methods allow users to extract more information from their NGS data, making the analytical pipelines more computationally intensive, this bottleneck is getting worse.
To address these computational challenges, the NVIDIA Clara healthcare team has started accelerating and optimizing these genomic analysis pipelines on Graphics Processing Units (GPUs). Traditionally, these chips were used for video-based applications, and as the GPU became more computationally powerful, compute oriented general-purpose workloads started taking advantage of these platforms. GPUs are now the main workhorse of majority of supercomputers and data centers to accelerate key applications.
Clara Parabricks overview
NVIDIA introduced the Clara Parabricks software suite of accelerated genomic analysis to support the three major NGS applications – germline variant analysis, somatic variant detection, and RNA-Seq analysis. The overall goal for the Clara Parabricks software is to provide at least an order of magnitude acceleration in compute time while generating identical outputs and reducing analysis costs. This powerful suite of genomic analysis tools called Clara Parabricks is now available on AWS as an AWS Marketplace AMI. It provides optimal performance for multiple instance types and can be used out of the box for essential bioinformatics needs. Currently, the Clara Parabricks accelerated analysis tools start with a FASTQ file to perform alignment through variant calling and expression analysis, including QC tools for both types of outputs. The suite of 33 tools can be used to support end-to-end workflows for germline, somatic and RNA-Seq pipelines, providing the flexibility to meet the individual needs of most projects.
The figure below shows most of the accelerated tools within the Clara Parabricks software package. Due to the accelerations of the pipelines, users can implement multiple variant callers to extract the most information from their data, and still generate the results in less time and at lower cost than using standard baseline software solutions. For example, GATK’s HaplotypeCaller and Google’s DeepVariant can be used to generate two VCF’s for the same dataset. This enables researchers to either perform a union of both callers to minimize their false negative rates or use the intersection to improve the false positive rates. A standard 30x WGS sample can be processed in less than an hour using both variant callers using p4d.24xlarge instance on AWS. Currently, the Clara Parabricks Pipelines software suite supports Amazon EC2 G4dn, P3 and P4d instances on AWS.
Germline analysis
In this blog, we will focus on germline variant detection and associated applications. These are the genetic variants an organism derives from its parents, the inherited variants. This is one of the most popular analyses for DNA sequencing data and is a prerequisite to population scale Genome Wide Association Studies (GWAS). With Clara Parabricks software, a user can go from a 30x human WGS FASTQ to generating a VCF using om using comparable GATK best practices germline analysis (shown below) in as quick as 25 minutes (actual time depends on instance type chosen). The same analysis on one CPU instance with out-of-the-box software can take close to 30 hours. While Clara Parabricks germline analysis also supports Google’s DeepVariant, in this blog, we will focus on the GATK4 best practices pipeline. Similar runtimes can be expected for Google’s DeepVariant as well.
Running Parabricks Germline Pipeline on AWS
Prerequisites
The prerequisites for running Parabricks on AWS are:
- An AWS account with permission to provision Amazon EC2, Amazon S3, and access AWS Marketplace.
- A VPC with at least a public subnet and a private subnet routed to a NAT Gateway.
Getting Started with the AMI
Step 1: Subscribe to NVIDIA Clara Parabricks Pipelines AMI in AWS Marketplace
NVIDIA Clara Parabricks Pipelines are available as an Amazon Machine Image (AMI) in AWS Marketplace. An AMI provides the necessary information to launch an Amazon EC2 instance. It can also be used to launch multiple instances of the same configuration. To subscribe to this AMI:
- Log in to your AWS account.
- Go to AWS Marketplace and search for “NVIDIA Clara Parabricks Pipelines”.
- Click on Continue to Subscribe.
- Read and accept the terms and conditions.
Step 2: Launch an EC2 instance
Once the subscription is complete, the AMI will appear on your list of AWS Marketplace Subscriptions, as shown in Figure 3. To launch an EC2 instance using this AMI:
- Click on Launch new instance.
- Use the pre-set values for Delivery method and Software version and select the Region in which you want to launch the instance.
- Choose an EC2 instance type. The recommended instance type is g4dn.12xlarge, which is selected by default.
- Configure instance detail. For the purpose of this demonstration, you can select the default setting and click on Review and Launch. This will launch a g4dn.12xlarge instance in a default VPC and subnet.
Once the instance is launched and ready to be used, use SSH to log in it.
ssh -i <access-key.pem> user-id@<public-DNS>
Step 3: Download data
cd /mnt/disks/local
aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R1.fastq.gz .
aws s3 cp s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/HG001.novaseq.pcr-free.30x.R2.fastq.gz .
aws s3 cp s3://parabricks.sample/parabricks_sample.tar.gz .
tar -xvzf parabricks_sample.tar.gz
Step 4: Run Parabricks
To run Parabricks germline pipeline on the above-mentioned dataset, you can use the following command:
cd /mnt/disks/local
pbrun germline –ref parabricks_sample/Ref/Homo_sapiens_assembly38.fasta \
--in-fq HG001.novaseq.pcr-free.30x.R1.fastq.gz HG001.novaseq.pcr-free.30x.R2.fastq.gz \
--out-bam 30x.bam \
--out-variants 30x.vcf \
--knownSites parabricks_sample/Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
--out-recal-file 30x_recal.txt
Benchmark analysis
We use the 30x HG001 hosted on Amazon S3 here for thorough analysis across instance types: s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/30x/
We use 50x, 75x, and 100x WES data for HG001 found here: s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wes_agilent/
We also use 50x WGS data for HG001 found here: s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/50x/
We run the Parabricks software across the following instance types: g4dn.12xlarge, g4dn.metal, p3dn.24xlarge and p4dn.24xlarge.
Results
Comparison of cost and runtime on different instance types
The Clara Parabricks software can run on Amazon EC2 G4, P3, and P4d EC2 instance families at different price and performance points. Figure 4 presents the time incurred for running the germline pipeline on the above dataset across different GPU instances on AWS. As reported, the analysis of a 30x genome can be completed in 80 minutes using a g4dn.12xlarge instance at an approximate cost of $1.62 using Spot Instances, while the price is $5.2 for On-Demand Instances. For faster runtimes, the same analysis can be done in 28 minutes using eight NVIDIA A100 GPUs in p4d.24xlarge at $4.6 with Spot Instances, and for $15 with On-Demand Instances. The choice of the instance type is dependent upon the end user’s needs, and the Clara Parabricks software is ready-to-run using the AWS Marketplace AMI.
The fq2bam step includes bwa-mem and parts of co-ordinate sorting, post-processing includes parts of co-ordinate sorting, marking duplicates followed by bqsr. haplotypecaller the applybqsr step applied on the input bam, which is then fed to the variant calling step.
Concordance analysis
The data used above is the Precision FDA HG001 dataset downsampled to 30x coverage [3]. By using the hap.py tool on the output vcf of Clara Parabricks analysis, and the baseline GATK4, we found the concordance (F1-score) to be 0.9999. The major differences are not systematic differences, but related to a few changes between the Clara Parabricks and GATK4 code that are related to random number generation and non-determinism when running GATK4. Clara Parabricks focused on being deterministic on every hardware platform and hence did not replicate the parts of GATK4 that result in non-deterministic output.
Performance and scaling
The Clara Parabricks germline pipeline has been used in several scientific projects and has shown over 0.9999 F1 concordance with baseline GATK [4]. The software scales to meet the demands of any NGS project and has been run on thousands of whole human genome and more than a million exomes. Figure 6 shows the runtimes and associated costs when tested on the given WGS and WES samples with varying coverage. It is based On-Demand Instance pricing with the less expensive g4dn.12xlarge instance. A 50x whole exome can be analyzed in 10 minutes for under 25 cents with Spot Instance costs and under 70 cents with On-Demand Instance pricing. The results show how the runtimes and costs scale proportionally with the amount of data. While this blog focuses on HaplotypeCaller for variant calling, similar levels of speedup and accuracy with baseline version can be achieved for the germline pipeline by using Clara Parabricks DeepVariant.
Customer success stories
Clara Parabricks Pipelines is now the trusted software of choice for several key large-scale genomics projects such as:
- Regeneron Genetics Center: Regeneron Genetics Center (RGC) is using the Clara Parabricks accelerated compute framework as the foundation for generating scalable, high-quality germline data that can be reproduced across the genomics community. RGC has implemented DeepVariant with Parabricks to maximize the value of this combined, enterprise-level resource. This supports applications from target discovery pipelines through public-industry collaborations, including the UK Biobank and Geisinger Health. To date, more than 1.2 million exomes have been processed with this analytical pipeline.
- Japan projectl: The Human Genome Center (HGC) at University of Tokyo uses Clara Parabricks to accelerate genomic analysis by 40X compared to a CPU-based environment.
- Thailand project: Clara Parabricks is used as the sequencing analysis software at the National Biobank of Thailand (NBT) to accelerate genomic sequencing as part of the government’s plan to promote genomic medicine in Thailand.
- TGEN: TGen is using NVIDIA GPUs and Clara Parabricks Pipelines to power a highly integrated high performance computing (HPC) infrastructure that delivers sequence analysis results to collaborative teams of researchers and physicians. The system accelerates whole genome sequencing analysis of both healthy and diseased cells, which helps determine the most effective therapy for each patient.
Conclusion
Clara Parabricks software suite provides multifold acceleration for major genomic analysis pipelines. This not only leads to faster analysis times, but also provides significant cost reduction on AWS. Traditional analysis that takes hours can be done in few minutes and can be scaled to processing large number of samples using AWS. With the AWS AMI, you can accelerate your workflows with the push of a few buttons. As we continue to update the Parabricks Pipelines, the number of tools will continue to expand. Furthermore, the software suite supports the latest releases of the individual tools, letting users take advantage of the latest advancements in the genomics community. To start your evaluation of Clara Parabricks, visit the AWS marketplace to get started on our AMI.
References:
[1] McCombie WR, McPherson JD, Mardis ER. Next-Generation Sequencing Technologies. Cold Spring Harb Perspect Med. 2019 Nov 1;9(11):a036798. doi: 10.1101/cshperspect.a036798. PMID: 30478097; PMCID: PMC6824406.
[2] GA4GH publishes review of national genomic data initiative: https://www.ga4gh.org/news/ga4gh-publishes-review-of-national-genomic-data-initiatives/
[3] Precision FDA Truth Challenge: https://precision.fda.gov/challenges/truth
[4] https://docs.nvidia.com/clara/parabricks/v3.6/text/publications_list.html