AWS Public Sector Blog
How the Imaging Data Commons migrated 40 million medical images using AWS DataSync
The volume of medical imaging data that is accessible for research and analysis continues to expand at a rapid rate. Providing the medical imaging research community with simple, equitable access to the data they need accelerates research and speeds the time to actionable insights for patients. The scalable storage and compute capabilities of the cloud can provide researchers with a rich platform of opportunities for data analysis.
The mission of the National Cancer Institute (NCI) Imaging Data Commons (IDC) team is to simplify the discovery and analysis of relevant cancer imaging data using state-of-the-art tools in the cloud. The IDC team is responsible for providing scientists and researchers with a comprehensive collection of cancer imaging data and tooling for analysis.
To give users the most flexibility to work with data with whatever service they choose, data should be available across many cloud providers. To maintain integrity of data mirrors across cloud providers, while keeping up with the regular updates of the IDC data offering, we needed a way to transfer data easily between cloud providers. We immediately saw a blocker when we assessed the magnitude of this task: The IDC currently amasses 60-plus terabytes (TB) of data composed of roughly 40 million objects. We needed to efficiently move and assess the integrity of the IDC dataset in a way that did not take away time from our core development responsibilities. AWS DataSync from Amazon Web Services (AWS) is an online data movement service that launched the ability to transfer data across cloud environments and provided an opportunity to unblock our use case with minimal disruption to the core team.
In this blog post, learn how the IDC team migrated the IDC data to AWS using AWS DataSync. Plus, learn how to get started with IDC data, which is accessible at no cost through the AWS Open Data Sponsorship Program.
How the IDC is enabling the research community
IDC provides unparalleled access to a large and continuously growing collection of standardized cancer imaging data, co-located with powerful analysis tools and scalable compute resources.
Access to this large collection of data opens unique opportunities in scientific discovery, integrative analysis of imaging data in the context of other data types, benchmarking, and refinement of rapidly evolving artificial intelligence (AI) capabilities. The complexity and scale of cancer imaging data, however, can challenge the capabilities of the resources available at any single organization to both efficiently search and explore large image collections, as well as analyze and share the resulting artifacts. Alongside the growing image data content, IDC provisions and maintains tooling to navigate the complex metadata accompanying the images, visualize the images, and ease the learning curve and challenges of adopting cloud for analysis tasks.
Transferring data across clouds with AWS DataSync
When AWS DataSync launched the ability to migrate data between Google Cloud Storage (GCS) and AWS storage services, the new capability met the needs of the IDC team. We outlined an architecture for moving data from GCS by deploying a DataSync agent as an Amazon Elastic Compute Cloud (Amazon EC2) instance. The GCS bucket consisted of roughly 40 million objects in a flat namespace that totaled roughly 63 TB. Based on the analysis, the team configured the Amazon EC2 agent as an m5.4xlarge and created a single DataSync task with a source GCS location and a destination Amazon Simple Storage Service (Amazon S3) bucket. The DataSync task was configured to not copy object tags from the source object storage location, as GCS does not support getting object tags through the XML API.
The IDC team successfully completed the initial data transfer with DataSync in less than 41 hours. The total cost to transfer 63 TB of data using the DataSync service and run the supporting m5.4xlarge EC2 instance for 41 hours was about $810. DataSync provided the simplified solution we needed to help us move forward with the v14 launch of the IDC dataset on the Registry of Open Data on AWS.
Get started with the Imaging Data Commons
To get started, explore the large, comprehensive, and expanding collection of cancer research data that the IDC offers by checking out the Imaging Data Commons user guide. Here, you can learn to build your manifest to retrieve IDC data from Amazon S3. Data can be retrieved via the AWS Command Line Interface (AWS CLI), the Amazon S3 console, or via an Amazon S3 URL. Finally, get working with IDC data on AWS with this introductory tutorial that shows you how to use IDC data with Amazon services such as AWS HealthImaging, Amazon SageMaker, Amazon Athena, and AWS Glue.
Learn more about AWS DataSync and how you can enable your multi-cloud data movement.