AWS HPC Blog

Run protein folding on AWS with Quantori

Run protein folding on AWS with QuantoriThis post was contributed by Marissa Powers, Konstantinos Tzouvanas, and Pavlos Kaimakis from AWS, and Mikhail Serkov, Senior Director, HPC Services at Quantori

Proteins are large biomolecules and the target for the majority of pharmaceutical drugs on the market today. Researchers are increasingly using machine-learning models to predict the three-dimensional structure and properties of proteins. With the proliferation of these models, scientists are looking for easy-to-use interfaces for testing with the most efficient price and performance possible.

Quantori is a scientific informatics data science company, and they’ve been working closely with biopharmaceutical scientists to build a solution for running generative AI protein models on AWS. Their solution allows users to easily run their protein engineering analyses and visualizations in their own AWS accounts.

In this post, we’ll show the user interface for this solution, the architecture, and some example price and performance benchmarks so you can get a feel for how this might work for you.

User interface

One of the key features of the Quantori solution is the user interface, which makes it easier for users to predict protein structures. Let’s walk through some key components of the solution and the user experience – including the login interface, running an analysis, and viewing results.

Login

The application uses username and password authentication by default and stores the credentials in the solution’s database. To sign up, you enter an email and manually create a password. Integration with existing identity managers is possible for production use.

Figure 1 – The login screen for the Quantori protein engineering solution. You can create a new username and password on the first log in.

Figure 1 – The login screen for the Quantori protein engineering solution. You can create a new username and password on the first log in.

Running an analysis

Once logged in, you will first see a list of proteins registered in the system (Figure 2). The list includes the number of prior runs and user-selected tags for each protein. You can easily filter the list by tag to show only protein runs associated with specific projects.

To add a new protein, you can explicitly type the amino acid sequence or upload a FASTA file (a text-based format for storing DNA and protein sequences).

Figure 2: The main landing page for the solution shows a list of proteins that have been run in the platform, including tags and the number of runs for each protein.

Figure 2: The main landing page for the solution shows a list of proteins that have been run in the platform, including tags and the number of runs for each protein.

Selecting a specific protein shows a summary of the execution results, including if the runs were successful and the method used for prediction (e.g. AlphaFold2; which you’ll see in Figure 3).

Figure 3: The summary page for protein 7FCC shows three runs completed, all with AlphaFold2.  The UI shows the three-dimensional structure for the protein, too, along with the execution timestamp and runtime.

Figure 3: The summary page for protein 7FCC shows three runs completed, all with AlphaFold2. The UI shows the three-dimensional structure for the protein, too, along with the execution timestamp and runtime.

To run an analysis, you should select Add run from the protein landing page (in Figure 3, again). This opens a page where you can easily select the machine-learning (ML) algorithm to run, the instance type from Amazon Elastic Compute Cloud (Amazon EC2) for each stage, and the database configuration (Figure 4).

Once you select your preferred settings, you just hit Run Folding! to begin the analysis.

Figure 4: To run a new analysis, you select the ML algorithm, Amazon EC2 instance type(s), and database configuration to use. This is the user interface for selecting these configurations and starting a run.

Figure 4: To run a new analysis, you select the ML algorithm, Amazon EC2 instance type(s), and database configuration to use. This is the user interface for selecting these configurations and starting a run.

View results

Once the analysis is complete, you can visualize your results (which we’ve shown in Figure 5). The interface includes confidence score metrics for the models run, along with the three-dimensional structure of the protein.

Figure 5: The results interface shows confidence metrics for the models run, along with the three-dimensional structure of the protein.

Figure 5: The results interface shows confidence metrics for the models run, along with the three-dimensional structure of the protein.

Architecture

The Quantori solution uses Amazon EC2 instances to host the user interface and AWS Batch to run protein folding jobs. We’ve shown an overview of the architecture in Figure 6.

The solution executes the user interface and AWS Batch jobs in separate Amazon Virtual Private Clouds (Amazon VPCs). The solution reserves one VPC for the user frontend, and a second for the backend.

Based on the selected configuration during a prediction run, AWS Batch downloads containers from Amazon Elastic Container Registry (ECR) and deploys them to appropriate instances in the backend VPC.

Instances in the backend VPC share an Amazon FSx for Lustre file system, allowing scaling to hundreds of GBytes/s of throughput . The container images the solution uses, reside in Amazon ECR and are sourced from an AWS CodePipeline.

Figure 6: The Quantori solution architecture includes two separate VPCs: a frontend VPC with EC2 instances, and a backend VPC with two AWS Batch compute environments. This is the same architecture as used in the protein folding on AWS Solution Guidance - available on GitHub.

Figure 6: The Quantori solution architecture includes two separate VPCs: a frontend VPC with EC2 instances, and a backend VPC with two AWS Batch compute environments. This is the same architecture as used in the protein folding on AWS Solution Guidance – available on GitHub.

Benchmarking

Let’s look at some results of testing Quantori performed across a range of proteins, instance types, and instance sizes.

The goal of these runs was to: (1) show expected performance; (2) demonstrate exemplar testing that you can easily run within the solution; and (3) provide guidance on best instance types for jackhmmer (a tool used to identify the evolutionary relationships and common patterns between genes) and AlphaFold2 (an AI system that can predict 3D structures of proteins from amino acid sequences with atomic-level accuracy).

Figure 7 shows the scalability across a range of protein lengths. With the Quantori solution, you can easily run models across multiple proteins.

Figure 7: As expected, the runtime for AlphaFold2 scales with protein length. Quantori tested proteins ranging from 61 amino acids (PDB ID 1B7D) to 601 (4OYS). The solution user interface simplifies selecting and analyzing a wide range of proteins.

Figure 7: As expected, the runtime for AlphaFold2 scales with protein length. Quantori tested proteins ranging from 61 amino acids (PDB ID 1B7D) to 601 (4OYS). The solution user interface simplifies selecting and analyzing a wide range of proteins.

Quantori chose a single protein (PDB ID 7FCC) for their next set of tests and ran this on multiple instance types and sizes. Figure 8 shows jackhmmer runtime across two Amazon EC2 instance families (c6g and m7g). The m7g.4xlarge provided faster runtimes due to higher memory compared to c6g.4xlarge, at a similar cost.

Figure 8: Here Quantori compares jackhmmer runtime across two instance families (c6g and m7g) using the UniRef90 dataset. We also show runtime for HHsearch and HHblits for these two instance types. M7g instance types provide faster performance in general due to having higher memory. 

Figure 8: Here Quantori compares jackhmmer runtime across two instance families (c6g and m7g) using the UniRef90 dataset. We also show runtime for HHsearch and HHblits for these two instance types. M7g instance types provide faster performance in general due to having higher memory.

Finally, Quantori compared AlphaFold2 runtime across three different sized g4dn series instance types. As shown in Figure 9, while the g4dn.4xlarge provides slightly better performance, the run costs more than twice as much as the smaller g4dn.xlarge, indicating that when we are just adding vCPUs but not changing the number of GPUs, jackhammer is not scaling linearly.

Figure 9: When comparing AlphaFold2 runtime and cost for three different instance sizes, we see slightly faster performance with the largest size tested (g4dn.4xlarge) but at nearly 2X the cost. The smaller g4dn.xlarge instance size is the best fit for this specific dataset and scenario.

Figure 9: When comparing AlphaFold2 runtime and cost for three different instance sizes, we see slightly faster performance with the largest size tested (g4dn.4xlarge) but at nearly 2X the cost. The smaller g4dn.xlarge instance size is the best fit for this specific dataset and scenario.

For this specific dataset (UniRef90 with the 7FCC protein), the m7g.4xlarge instance type for jackhmmer and the g4dn.xlarge for AlphaFold2 provide best price-performance from the instances Quantori tested. You can easily configure multiple architectures (e.g. x86, aarch64, or GPUs) for a single workflow and compare multiple instances types – and sizes – with the Quantori solution.

While these specific instances provide a good starting point, Quantori recommends using the solution to test and optimize for your specific workflow and datasets.

Conclusion

Scientists are using protein engineering models to accelerate drug discovery research.

The Quantori solution allows you to submit jobs with different input datasets and parameters easily. You can monitor jobs and visualize the results all from a single easy to use interface. This solution allows scientists to focus more on discovery and less on maintaining software stacks and environments. And that’s an important way the cloud can contribute immediately to medical science.

For more information, contact Quantori at contact@quantori.com.

Mikhail Serkov

Mikhail Serkov

Mikhail Serkov is a senior director, High-Performance Computing (HPC) services at Quantori. He has been working in HPC field for last 15 years, focusing on helping life-science clients improve the performance and quality of service of their computational environments.

Konstantinos Tzouvanas

Konstantinos Tzouvanas

Konstantinos Tzouvanas is a senior enterprise architect on AWS, specializing in data science and AI/ML. He has extensive experience in optimizing real-time decision-making in High-Frequency Trading (HFT) and applying machine learning to genomics research. Known for leveraging generative AI and advanced analytics, he delivers practical, impactful solutions across industries.

Pavlos Kaimakis

Pavlos Kaimakis

Pavlos Kaimakis is a solutions architect at AWS supporting customers to design and implement solutions that drive value to them. Pavlos has spent the biggest amount of time in his career in the product and customer support sector. He loves travelling and he’s always up for exploring new places in the world.

Marissa Powers

Marissa Powers

Marissa Powers is a specialist solutions architect at AWS focused on high performance computing and life sciences. She has a PhD in computational neuroscience and enjoys working with researchers and scientists to accelerate their drug discovery workloads. She lives in Boston with her family and is a big fan of winter sports and being outdoors.