AWS Public Sector Blog

The Institut Pasteur is creating a searchable DNA database of all life on Earth using AWS

AWS branded background design with text overlay that says "The Institut Pasteur is creating a searchable DNA database of all life on Earth using AWS"

Where will the next pandemic-causing virus come from? The answer to this pressing question is locked away in the immense diversity of DNA carried around by life on Earth. A research team located at the Institut Pasteur, a Paris-based leading international research organization, plans to break into that vault of knowledge with IndexThePlanet – an initiative backed by the European Union (EU). The project aims to index the DNA of all living organisms, identify previously unknown viruses species, and create a DNA search engine. The Institut Pasteur research team has chosen Amazon Web Services (AWS) to help them investigate all of life’s DNA.

Cataloguing all the organisms you can see, and all that you can’t

Analyzing all of the DNA on the planet is a gargantuan task. The Sequence Read Archive (SRA) is an open repository of collected DNA samples stored on AWS and currently includes more than 19 petabytes of data. A petabyte is a gigantic volume of data, roughly equivalent to a million human genomes. The SRA expands every time scientists collect samples, from leaf litter in the Amazon rainforest, fungi in France, elephant seal dung in the Antarctic, and everywhere in between. Any DNA sequence could contain clues to the next virus to jump species and trigger a pandemic.

Indexing this mind-bending diversity of microscopic DNA poses a challenge for even today’s computers. Current methods of working with the SRA require “enormous informatics resources,” according to Rayan Chikhi, PhD, group leader in computational biology at the Institut Pasteur and IndexThePlanet project lead. “It’s inconceivable that any lab would download tens of petabytes of data,” he adds.

Releasing DNA data to detect virus threats

The solution involves finding and indexing regions of similarity between DNA sequences, which will enable researchers who’ve identified one characteristic of a virus to find other viruses with similar features. This makes searching Earth’s virome of potential pathogens far more feasible. However, the process of categorization itself demands large scale compute capacity.

Chikhi selected AWS for Logan/IndexThePlanet, a large dataset of DNA and RNA sequences that is the crucial initial phase in the indexing initiative. Logan involved transforming 19 petabytes of open genomics data on AWS into 2 petabytes of indexed, searchable data. This would have taken more than 3,400 years using a single conventional computer. By running 2.18 million virtual CPUs (vCPUs) on Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances using AWS Graviton, the team at the Institut Pasteur and AWS were able to process the data in 30 hours.

This solution provides the crucial foundation for the Institut Pasteur to develop the DNA equivalent of a modern search engine, which will transform researchers’ ability to tackle emerging virus threats. At this point, only about 0.01 percent of the Earth’s viruses are known to science, and this project aims to expand our knowledge of viruses by at least an order of magnitude.

AWS compute capacity is powering the Institut Pasteur’s mission to change that – and fast.

The results of this project, hosted freely on the Registry of Open Data on AWS, are now broadly available for use to the entire scientific community.

Feeling inspired? Start exploring the Logan/IndexThePlanet genetic data today.