AWS Partner Network (APN) Blog
Empowering Researchers to Run HPC Workloads on AWS with Research Gateway
By Shun Utsui, Noorbakht Khan, and Robert M. Bata – AWS
By Puneet Chaddah, CTO – Relevance Lab
By Sundeep Mallya, Head of Research Gateway Product – Relevance Lab
Relevance Lab |
Modern scientific research depends heavily on processing large-scale datasets which requires elastic, scalable, easy-to-use, and cost-effective computing resources.
Amazon Web Services (AWS) offers the broadest and deepest functionality for compute and storage resources, including high-performance computing (HPC). However, researchers can find it challenging to consume without foundational knowledge of AWS. They want simple and efficient ways of running their research workloads, where it’s easy to provision scalable research environments, run their workloads, and have a complete view of the costs.
To address these customer requirements, Relevance Lab developed Research Gateway, a solution that simplifies access to high-performance computing on AWS and makes it easy for researchers to provision and use HPC resources.
Research Gateway simplifies access to HPC clusters using a self-service portal, making provisioning and configuration of an elastic cluster easy for researchers. This helps them focus on the research itself, while leveraging AWS ParallelCluster for their scientific computing.
Relevance Lab is an AWS Select Tier Services Partner and AWS Marketplace Seller that’s a platform-led services company specializing in cloud, DevOps, automation, analytics, and digital transformation.
In this post, we provide an overview of the Research Gateway solution architecture and standard AWS HPC research workflow. Focusing on a single use case, the post includes a walkthrough of how to access Research Gateway, provision products required to build an HPC stack on AWS, and run GROMACS analysis. We also demonstrate how to view the outputs and dive deep on the cost components of this workload.
Solution Overview
The Research Gateway architecture allows products to be provisioned in minutes in a simple interface that’s easy to navigate and use by researchers. The Research Gateway instance is deployed using Amazon Elastic Kubernetes Service (Amazon EKS) and acts as the orchestration layer, providing capabilities to manage research projects.
Each project can be managed in a separate AWS account to provide granular governance and control, based on the administrative unit. Figure 1 shows the architecture diagram for an HPC environment (AWS ParallelCluster) deployed through Research Gateway.
Figure 1 – Research Gateway architecture with AWS ParallelCluster.
- Access research environments in minutes, with user sign-up, authentication, and federation provided by Amazon Cognito and third-party identity providers. With a simple interface, administrators and lead researchers manage researcher permissions and accounts, as well as workspace permissions.
- Ease of access through self-service mechanism to deploy repeatable pre-built templates comprising infrastructure, application, and workflow components. A self-service catalog built on AWS Service Catalog is presented to end users to find and deploy approved IT templates and AWS Marketplace products.
- Secure access for users and administrators to access project resources using browser-based interfaces, secured by Secure Shell (SSL). To make this process scalable and reliable, the solution uses AWS WAF, AWS Certificate Manager, and Application Load Balancer. Finally, Amazon CloudFront is used to improve performance by lowering the latency between the keyboard and the web server.
- Cost transparency and spend controls using AWS Budgets and AWS Cost Explorer to track projects, researchers, products, and pipelines. Consumption guardrails are implemented at the project level to flag any breach of project budgets and trigger pause or datasets for a project.
- Tertiary analysis with interpretation and deep learning using Genomics Data Lake integration, RStudio, and Amazon SageMaker.
- Reduction in time to research by selecting commonly-used pipelines that are pre-configured to execute on AWS Batch.
- Reuse of existing pipeline code by bringing their own code through AWS CodeCommit or containers from Amazon Elastic Container Registry (Amazon ECR).
- Environment monitoring, governance, and observability implemented through Amazon CloudWatch and AWS CloudTrail with the ability to integrate with customer enterprise systems.
Project Storage on Amazon S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers high-throughput data ingestion, cost-effective storage options, secure access, and efficient searching.
Research Gateway creates a shared project storage backed by Amazon S3 on a per-project basis for its users, who can also assign datasets to a project in this storage system. This means customers can use public datasets hosted in the AWS Open Data Sponsorship Program by default.
Researchers can also bring large on-premises datasets on board using AWS DataSync or upload files using the user interface (UI).
High-Speed File System on Amazon FSx for Lustre
Amazon FSx for Lustre provides fully managed shared storage with the scalability and performance of the popular Lustre file system.
Research Gateway makes it easy for users to create an FSx for Lustre file system and populate it with data from an Amazon S3 bucket. This storage can be accessed with low (sub-millisecond) latencies by the compute instances in the HPC cluster, and can provide up to 1,000 MBps/TiB of disk throughput.
In addition, Amazon FSx for Lustre takes care of administrative tasks such as patching, backups, and maintenance. With FSx for Lustre, users can quickly deploy and configure Lustre file systems without having to manage complex underlying operations.
HPC Deployment with AWS ParallelCluster
AWS ParallelCluster is an open-source cluster management tool written using Python and is available via the standard Python Package Index (PyPI). Version 3.X also provides support for APIs and Research Gateway leverages this to integrate with the AWS Cloud to set up and use the HPC cluster for complex computational tasks.
AWS ParallelCluster supports two different schedulers, AWS Batch and Slurm, which cover a vast majority of the requirements in the field. ParallelCluster brings many benefits including scalability, manageability of clusters, and seamless migration to the cloud from on-premises HPC workloads.
Visualization with NICE DCV
NICE DCV is a high-performance remote display protocol used to deliver remote desktops and application streaming from resources in the cloud to any device. Users can leverage this for their visualization requirements.
NICE DCV on AWS provides a highly secure infrastructure that includes features such as data encryption, secure connectivity, and access control. It also has the ability to transfer pixels efficiently, resulting in a fast and responsive user experience.
In addition, NICE DCV on AWS helps organizations reduce costs by only paying for the underlying infrastructure consumption so they don’t have to pay for any license fees.
Running and Visualizing an HPC Application on Research Gateway
Now, let’s go through the process of deploying a scalable HPC system on AWS with flexible sharing mechanisms across team members and cost guardrails.
In this example, we are showcasing the workflow of a real-world HPC application called GROMACS (or GROningen Machine for Chemical Simulations), which is a free and open-source software suite for high-performance molecular dynamics. It’s popular amongst researchers in domains such as materials science.
Step 1: Upload Your Data
First, we want to make sure we have a project bucket where we can upload data and share any work with other project members. Under My Products, click on the product where you see Amazon S3.
Figure 2 – My Products screen where you’ll find your project storage Amazon S3.
This is the bucket for the project we are assigned to. On this bucket, create a new folder named “benchmarks” and upload the input file for the GROMACS job. As an example, we are using the benchRIB (2M atoms, ribosome in water, 4 fs time step) case published by the Max Planck Institute.
On the folder we created, upload the benchRIB.tpr file and confirm it’s uploaded.
Figure 3 – Amazon S3 bucket for storing files.
In the next step, we’re going to create an Amazon FSx for Lustre file system, and link the S3 bucket onto it. To do so, we’ll need the name of the S3 bucket, so copy it onto the clipboard.
The Amazon S3 integration makes it possible for files in S3 to be accessed on an Amazon Elastic Compute Cloud (Amazon EC2) instance in a POSIX-compliant way through FSx for Lustre. The data stored in S3 will be lazy loaded onto FSx for Lustre when a user tries to access it. You can also dump your data back to S3 once you’re done processing files on FSx for Lustre to cost-optimize your storage utilization.
Step 2: Deploy High-Speed Parallel File System
Next, we’ll deploy a high-speed parallel file system. This step is required for users who need a parallel file system for intensive I/O processing. From Available Resources, choose FSx for Lustre, give the file system a unique name, and specify the capacity needed.
In the previous step, we copied our S3 bucket into the clipboard. Under ImportPath, paste the S3 bucket name to link the bucket to the Lustre file system and click on LAUNCH NOW to launch an FSx for Lustre file system in around 10-15 minutes. In this demo, we link our FSx for Lustre file system with an S3 bucket.
Figure 4 – Amazon FSx for Lustre details on Research Gateway.
Once the file system is ready, you’ll see Provisioned with a green light on the top right of your screen. In the Outputs tab, you’ll find the unique ID of your FSx for Lustre file system. Copy the file system ID to the clipboard so you can mount it on your HPC cluster, which we’ll create in the next step.
Step 3: Deploy HPC Cluster
We are now ready to deploy our compute resources through ParallelCluster. On the PCluster configuration screen, select an SSH key pair. If you don’t have a key pair, create one by clicking on the “+” sign next to the KeyPair drop-down.
Figure 5 – Setup of PCluster.
In this demo, we select c6i.2xlarge for the head node, and choose the default virtual private cloud (VPC). You can also choose other VPCs and subnets, and that can be done by preconfiguring them on your AWS account.
For the FileSystemType, choose FSxForLustre because in this scenario we want to work on a parallel filesystem that supports high I/O requirements. For workloads that don’t require high I/O, you can also choose “EFS” or “EBS” for cost optimization.
Amazon Elastic File System (Amazon EFS) is an elastic file storage based on NFSv4, and you only pay for the storage capacity used. Amazon Elastic Block Store (Amazon EBS) is a high-performance block storage you can attach to EC2 instances. Amazon EBS doesn’t attach to multiple instances by itself, but AWS ParallelCluster exports EBS volumes from the head node over NFS to its compute instances.
In the Scheduler Configuration section, select Slurm as our Scheduler. For ComputeNodeInstanceType, select c6i.32xlarge. This instance type provides 64 physical cores per instance.
Meanwhile, QueueCapacityType can either be SPOT or ONDEMAND. In this demo, we select ONDEMAND but if you have short and non-critical jobs you could also choose SPOT to cost optimize your scientific computation.
Next, enter “0” for MinimumInstances so the cluster scales down to zero compute instances when there are no jobs submitted. In this example, we enter “4” for MaximumInstances, but depending on your compute requirements this number can be anything larger than 1.
Once the cluster is deployed, we can log onto it by clicking on Remote Desktop to access our HPC cluster over NICE DCV. Once we specify the key pair created for the cluster and click Login, we are logged on to the graphical user interface (GUI) of the head node. If you don’t need a GUI, you can click on SSH Terminal, which opens a terminal screen.
On the head node, you can open a terminal and check the configuration of your cluster. We can see here the Lustre file system is mounted on /fsx.
Figure 6 – GUI view of Lustre file system.
Step 4: Install Scientific Application
Before we use the cluster, we’ll need to install an application and upload the input data. Research Gateway allows you to deploy ParallelCluster with Spack, which is a package manager designed to install scientific software. In this example, we are installing GROMACS with Spack.
On the terminal, execute the following command:
$ spack install gromacs
Now that the application is installed, we can submit our first job onto the scalable HPC cluster. With ParallelCluster, compute instances will only be deployed when there’s a job submitted through the Slurm job scheduler.
Step 5: Execute Job and Visualize Results
We are submitting the following script named gromacs.sh onto the cluster. This script is requesting resources of two compute instances, each with 64 cores. We’re also specifying the input data we uploaded on our S3 bucket and is visible on FSx for Lustre on the directory /fsx/gromacs/benchRIB.tpr.
#!/bin/bash
#SBATCH —job-name=bench-RIB
#SBATCH —nodes=2
#SBATCH —tasks-per-node=64
#SBATCH —error=%x_%j.err
#SBATCH —output=%x_%j.out
#SBATCH —exclusive
spack load gromacs
cd /fsx/gromacs/
mpirun -np ${SLURM_NTASKS} gmx_mpi mdrun -s /fsx/gromacs/benchRIB.tpr
We are now ready to submit our first job onto the cluster:
sbatch gromacs.sh
AWS ParallelCluster will take a job submission through Slurm as the trigger to start its compute resources. A few minutes after we submit our GROMACS job, it turns into “R” or the Running state. This process can be monitored with the squeue
command. The job takes about five minutes to complete.
Once completed, we see the output files under the same directory as our input files. Using a visualization tool such as Visual Molecular Dynamics (VMD), we can visualize the results of the job and interact with it. This can be done by accessing the ParallelCluster head node through NICE DCV.
Figure 7 – Using VMD on NICE DCV to visualize the results.
Cost Consumption Monitoring
Personas selected by the customer—for example, principal investigator (PI)—have the ability to see the cost information on the Research Gateway interface. PIs can set up budgets for the projects created, and this can be adjusted flexibly based on the situation of the project.
If a project has already started under a certain budget and was allocated additional budget later, Research Gateway provides the flexibility for PIs to change the budget of the project. The PI of a project also has the capability to view the cost breakdown and analysis on who is spending cost on which AWS services.
Figure 8 – Monitor project costs.
PIs can drill down into details such as when a certain resource was provisioned or terminated, and how much cost is associated to it. This gives PIs the opportunity to catch any cost anomalies or overspend of AWS resource consumption early.
By default, Research Gateway sends an alert to the PI if 80% of the budget is exceeded, and stops resources when the cost hits 90%. Therefore, PIs can safely let their team members focus on research and innovation without worrying on unnecessary AWS resource consumption overspend.
Figure 9 – Monitor researcher and resource-level aggregated costs.
Summary
To help researchers provision a secure, performant, and scalable high-performance computing (HPC) environment on AWS, Relevance Lab developed the Research Gateway. In this post, we walked you through how a researcher can access Research Gateway, provision products required for their GROMACS workload, perform analysis, and review outcomes.
To get started with HPC on AWS and run your first research workload, enroll in a free trial with Research Gateway.
Contact Relevance Lab to learn how Research Gateway can help customers to accelerate HPC adoption and maps to customer use cases. You can also learn more about Relevance Lab in AWS Marketplace.
Relevance Lab – AWS Partner Spotlight
Relevance Lab is an AWS Select Tier Services Partner that’s a platform-led services company specializing in cloud, DevOps, automation, analytics, and digital transformation.