AWS Public Sector Blog
OpenFold, OpenAlex catalog of scholarly publications, and Capella Space satellite data: The latest open data on AWS
The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). We work with data providers to democratize access to data by making it available to the public for analysis on AWS; develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and encourage the development of communities that benefit from access to shared datasets. Through this program, customers are making over 100 petabytes (PB) of high-value, cloud-optimized data available for public use.
Our full list of publicly available datasets are on the Registry of Open Data on AWS and are now also discoverable on AWS Data Exchange. This quarter, we released 15 new or updated datasets including OpenFold, OpenAlex, and radar data from Capella Space. Check out some highlights:
OpenFold training data for protein structure prediction
OpenFold, an Open Molecular Science Foundation project driven by a private-public consortium including Columbia University, Arzeda, and Cyrus Biotechnology, was developed as a trainable, fully open source improvement on AlphaFold2, which disrupted the protein structure prediction space with its debut in 2021. Its accompanying training dataset is a comprehensive, open source, machine-learning (ML) ready dataset for protein structure prediction.
OpenAlex, an index of the entire scholarly research ecosystem
Launched this quarter, OpenAlex is an open and comprehensive index of the entire scholarly research ecosystem. Named after the ancient Library of Alexandria, the dataset aims to discover, disambiguate, index, and document the connections between all the world’s scholarly papers, journals, authors, institutions, and concepts. In keeping with the theme of openness, the code behind it is all open source, and the data is all permissively licensed and designed to be simply used within production workloads. Whether you want to understand the impact of a given research area, discover how ideas and authors are linked through time, or build a front-end to help researchers find papers, the data is all there. OpenAlex joins PubMed Central® and CORD-19 as textual repositories collecting scholarly articles across a number of domains.
Capella Space Open Data and Sentinel-1 Single Look Complex (SLC) data for Germany
Two new Synthetic Aperture Radar (SAR) datasets launched this quarter: Capella Space SAR Open Dataset and Sentinel-1 Single Look Complex(SLC) for Germany. Capella Space is providing a growing collection of radar products and formats from its constellation of very high resolution SAR satellites to help further its mission to make Earth observation (EO) an essential tool for problem solving. Sentinel-1 SLC data for Germany includes radar data processed in a format that enables a wide array of applications including natural hazards and emergency response, oil spill monitoring, and monitoring sea-ice conditions. LiveEO has released the historical archive from 2014 to present from the Alaska Satellite Facility (ASF) DAAC as unzipped files, which drastically improves the efficiency and processing of this data.
Here is a full list of the datasets released this quarter joining over 300 datasets already available:
Climate and weather:
- Wave Ensemble Reforecast from the US National Oceanic and Atmospheric Administration (NOAA)
- Unified Forecast System Short-Range Weather (UFS SRW) Application from NOAA
- CMAS Data Warehouse from Community Modeling and Analysis System
Geospatial:
- Synthetic Aperture Radar Data from Capella Space
- Sentinel-1 Single Look Complex (SLC) for Germany by LiveEO
Life sciences:
- OpenFold Training Data from Open Molecular Science Foundation
- Cell Painting Gallery from the Broad Institute
- The Protein Data Bank from the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) at Rutgers University
Machine learning:
- Learning to Rank and Filter – Community Question Answering from Amazon
- Humor patterns used for querying Alexa traffic from Amazon
- 2021 Amazon Last Mile Routing Research Challenge Dataset from Amazon
Renewable energy:
- Prediction of Worldwide Energy Resources (POWER) from the National Aeronautics and Space Administration (NASA)
Space:
- Defense Meterology Satellite Program (DMSP) Auroral Particle Flux managed by the University of Colorado, Boulder
- Earth Radio Occultation managed by Atmospheric and Environmental Research, Inc.
Statistical and regulatory:
- OpenAlex dataset from OurResearch
We’re excited to see how you can put these great datasets to work. If you have examples of tutorials, applications, tools, or publications that use these datasets, make sure to list them on the Registry of Open Data on AWS so the community can find them. Learn how to propose your dataset to the AWS Open Data Sponsorship Program and learn more about open data on AWS.
Read related stories about AWS and open data:
- Creating access control mechanisms for highly distributed datasets
- Downscaled CMIP5, 1950 US Census, and open genomics data for Galaxy: The latest open data on AWS
- Street-scale global maps, orca sounds, and COVID-19 detection data: The latest open data on AWS
- How to set up Galaxy for research on AWS using Amazon Lightsail
- AWS hosts new open dataset to help businesses identify climate finance risks and investments
- Introducing 10 minute cloud tutorials for research
Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.
Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.