AWS Open Source Blog
Enabling Scientists to Collaborate with Amazon EKS and Open Science Studio
To enable scientists from around the world to collaborate by sharing data and processes, and generating reproducible results, Navteca chose AWS and Amazon Elastic Kubernetes Service (Amazon EKS) as the foundation for their data platform. Navteca is a contractor supporting U.S. federal civilian agencies such as NASA, NOAA, and USGS, and has collaborated with AWS on this effort in support of the White House’s Office of Science and Technology Year of Open Science initiative. Open science is a movement that aims to make scientific research more transparent, accessible, and collaborative. It attempts to address several problems in the current scientific research system, including the lack of reproducibility, the publication bias towards positive results, and limited access to research outputs. As part of this effort AWS and Navteca leveraged numerous open source technologies such as JupyterHub, Dask, Crossplane, and Flux CD.
Navteca has the ultimate goal of creating “scientific models as a service,” a way for researchers around the world to execute common scientific models directly from a familiar interface. “When scientists share their research or models with the community it is often hard to replicate the science because the underlying hardware and software requirements are complex to recreate,” said Ramon Ramirez-Linan, Navteca CTO. For example, to run a model created by another scientist, researchers need to download the dependent libraries, compile the code, and provision sufficient computing resources to run the model, all while meeting stringent security and governance requirements. For specialized IT professionals familiar with high-performance computing (HPC) workloads this may be a straightforward task; however, scientists, researchers, and students trying to reproduce results may not have the expertise to deploy the required infrastructure and software reliably to the cloud. This leads to problems reproducing the results which can slow down the overall progress of research.
As a first step towards this goal, Navteca wanted an open source solution to automate provisioning of Daskhub (JupyterHub with Dask) on demand. Prior to this solution, the process of provisioning a new Daskhub installation could take up to a day and needed manual intervention to get it into a working state. With this solution, provisioning all resources takes minutes with no manual interventions necessary. This aligns with the ‘Data on EKS’ initiative at AWS which acknowledges the importance of Big Data and Machine Learning (ML) to global research agencies and industries on Kubernetes and strives to open source performant architectures which facilitate this work.
Open source components
The first implementation consists of several components: JupyterHub, Dask, Flux GitOps, Crossplane, and Navteca open source Jupyterlab extensions — all hosted on Amazon EKS.
Combining these components allows you to create a highly scalable multi-tenant data analysis environment that can support many concurrent users simultaneously. Moreover it’s very easy to support and can be configured or modified in minutes. It offers a GUI interface which researchers can use to leverage compute and analytic libraries without having to understand how the underlying infrastructure works and without needing to understand how to use the command line or SSH into a HPC cluster. In addition, this environment can be instantiated in any one of the AWS global regions allowing researchers who may not have access to an expensive on-premise HPC cluster to quickly create a multi-tenant research environment. This environment can then be used for collaborative work by thousands of data scientists and researchers.
If you wanted to give the Open Science Studio a try for free and do some data science of your own, you can visit NASA’s website to get started. Anyone can register using just an email address and spin up a notebook that can be used for all sorts of different analyses.
Navteca is also developing additional JupyterLab extensions for the scientific community such as Bucket Explorer (bexplorer), which allows users to browse private datasets in AWS Simple Storage Service (Amazon S3), as well as Open Data on AWS and API Baker that uses Amazon API Gateway and AWS Lambda to turn any Jupyter Notebook into a secure API Endpoint.
Solution walkthrough
To deploy the solution in your own AWS account here are the high level steps.
- First create an Amazon EKS cluster and deploy Crossplane with AWS Crossplane providers, and a GitOps engine (ie FluxCD or ArgoCD). You can find an example here.
- Deploy a Crossplane Composition that will reconcile a new instance of the scientific research solution, this will include an Amazon EKS cluster, helm chart for DaskHub (it includes JupyterHub), and Cognito user pool. You can find Navteca’s composition along with instructions on how to deploy it here.
Working together AWS and Navteca were able to leverage Crossplane running on Amazon EKS to allow for the rapid creation of a shared DaskHub environment where users can collaborate on data science and research. In the DaskHub environment end users can not only run analysis themselves but they can also share that analysis and its results so that it can be reproduced, verified and understood by others.
By approaching the creation of infrastructure through the use of Kubernetes and Crossplane, it allows for the creation of a robust and performant shared services platform. This platform can be used for many workloads in addition to the Daskhub workload discussed in this blog post. It is our hope that the open source code used for this effort will not only allow organizations across the world to experience the positive outcomes from this specific workload but also help promote the usage of Amazon EKS and Crossplane in creating shared services platforms for a wide range of possible workloads.
Conclusion
We look forward to continuing to collaborate to help empower scientists, students and researchers to do great things with open science. The work that can be done with these tools is important and having the opportunity to potentially contribute to that work is meaningful. We wish NASA and Navteca luck in their future pursuits and look forward to what cutting edge infrastructure built on AWS will enable scientists to do.