AWS Storage Blog
King Hamad University Hospital and Bahrain Oncology Center use Amazon S3 to store millions of medical images
King Hamad University Hospital (KHUH) and Bahrain Oncology Center is a 600-bed-hospital in Bahrain. Over a period of eleven years, KHUH accumulated around 1 million medical image studies, resulting in around 476 million files, with a total data volume of 48 TB.
Read part 1 of the blog series to learn about the architecture King Hamad University Hospital and Bahrain Oncology Center implemented for long-term storage of medical image data. This part 2 of the blog series will dive into how KHUH built a cost estimate to understand expected cost with data growth in the future.
In this blog, we start with a discussion of the approach taken for assessing the current on-premises storage system at KHUH. We then show how the data collected during the assessment allowed KHUH to build a detailed cost estimate. We conclude by presenting opportunities for cost optimizations that were identified during the storage assessment and cost estimate analysis. With these optimizations, KHUH was able to reduce archive storage cost by 33%.
How KHUH analyzed their storage and developed trends
During a proof-of-concept, KHUH evaluated two Amazon S3 storage classes for long term storage: Amazon S3 Glacier Instant Retrieval and Amazon S3 Glacier Flexible Retrieval. Amazon S3 Glacier Instant Retrieval is an archive storage class that delivers the lowest-cost storage for long-lived data that is rarely accessed and requires retrieval in milliseconds. Amazon S3 Glacier Flexible Retrieval is suitable for archiving data that does not require immediate access but needs the flexibility to retrieve large sets of data at no cost.
As KHUH evaluated the storage classes, they learned S3 Glacier Instant Retrieval would be ideal for their use case as it would enable them to retrieve medical imaging data within milliseconds when needed. This would simplify patient consultations at KHUH as doctors would not need to retrieve patient data prior to appointments.
However, the two storage classes differ in their storage and retrieval costs and KHUH wanted to better understand those differences. The storage costs are higher for S3 Glacier Instant Retrieval compared to S3 Glacier Flexible Retrieval for the advantage of data retrieval in milliseconds. On the other hand, the cost for PUT, COPY, POST, LIST, and Lifecycle Transition requests are higher for S3 Glacier Flexible Retrieval.
Before starting to use any of the storage classes, KHUH wanted to get a complete view of the cost for the services used in the architecture, and what influences this cost over time.
Initial data assessment
In order to estimate the cost for storing data in AWS, KHUH performed an assessment of the data stored on the file server used by their picture archiving and communication system (PACS) server. Information about file quantities and data volume were collected using a few Windows PowerShell commands.
It was a requirement to get a view of the total number and total size of files smaller and larger than 40 KB, as well as smaller and larger than 128 KB. The reasons for these thresholds are: 1/ Amazon S3 does not transition objects smaller than 128 KB from the S3 Standard to S3 Glacier Instant Retrieval storage class (see ‘Constraints’ section in Transitioning objects using Amazon S3 Lifecycle) and 2/ S3 Glacier Flexible Retrieval has a minimum billable object storage size of 40 KB. Objects smaller than 40 KB in size may be stored but will be charged for 40 KB of storage (see S3 FAQs). To simplify the cost estimation, we assumed, that files smaller than 40 KB will remain in the S3 Standard storage class.
To get this information, the commands below were run against each of the data volumes used by the PACS server.
Get total number of files smaller than 128 KB and their total storage:
Get-ChildItem . -recurse | where-object {$_.length -lt 128000} | Measure-Object -Property Length -Sum | Select-Object Sum, Count
Get total number of files smaller than 40 KB and their total storage:
Get-ChildItem . -recurse | where-object {$_.length -lt 40000} | Measure-Object -Property Length -Sum | Select-Object Sum, Count
Get total number of files larger than 128 KB and their total storage:
Get-ChildItem . -recurse | where-object {$_.length -gt 128000} | Measure-Object -Property Length -Sum | Select-Object Sum, Count
Get total number of files larger than 40 KB and their total storage:
Get-ChildItem . -recurse | where-object {$_.length -gt 40000} | Measure-Object -Property Length -Sum | Select-Object Sum, Count
The Get-ChildItem
cmdlet is run at the root directory of the volume. It runs recursively through all directories
(Get-ChildItem . -recurse
) and looks for file objects of a certain size
(where-object {$_.length -lt <size in bytes>}
). Options -lt
and -gt
filter for “larger than” and “greater than” a given size in bytes. All files are summed up and the total number is
counted (Measure-Object -Property Length -Sum | Select-Object Sum, Count
).
A typical output looks as follows:
Get-ChildItem . -recurse | where-object {$_. length -lt 128000} | Measure-Object -Property Length -Sum | Select-Object Sum, Count Sum Count --- ----- 173994230 14178
Note, a large amount of RAM on the Windows file server is consumed when running the Get-ChildItem
cmdlet on large volumes. For this reason, the command was run in batches. KHUH’s PACS system organizes image data in a folder hierarchy of <year>/<month>/<day>/<study ID>
. The Get-ChildItem
cmdlet was run against each year folder to limit the amount of RAM consumed. To avoid performance impacts for end users, all commands were run outside normal business hours.
The results of the batches were collected using Excel and these results provided the quantities required for the economic considerations discussed in the next section.
In summary, the KHUH PACS data comprised 48 TB across more than 476 million files.
These files represent around 1 million medical image studies KHUH has accumulated since 2011. Medical image studies are not generally accessed after a year of storage. Specifically, KHUH noted that only 10% of their older medical image studies (those produced before 2019) were accessed in 2021. The assessment demonstrated that the majority of medical images can be stored in a long-term archive.
Economic considerations
During the proof-of-concept phase, KHUH evaluated two architectures with different Amazon S3 storage classes. Although using the Amazon S3 Glacier Instant Retrieval storage class simplified the architecture, the storage cost is around 23% higher compared to Amazon S3 Glacier Flexible Retrieval (as of November 2022 for the Middle East (Bahrain) Region).
To make an informed decision on which architecture to implement in a production environment, KHUH built a cost estimate based on the actual access patterns of the existing medical image studies in their PACS system.
The diagram below depicts the different cost elements as a file passes through the lifecycle stages.
Figure 1: Lifecycle stage cost elements
The table below walks through the lifecycle stages shown in the above diagrams, reviews cost elements, and details the actual quantities that KHUH has applied in their cost case. For a more detailed discussion of the architecture components refer to part 1 of this blog series.
Lifecycle stage | Cost element and quantities |
1 | S3 File Gateway – data written: The Amazon S3 File Gateway cost is dependent on the amount of data written by the gateway. There is no charge for the S3 File Gateway itself. There is a charge only for the amount of data written to AWS. This is beneficial for the secondary File Gateway. As this is a server on standby, no cost is incurred for this second gateway. KHUH quantities: The total data volume written by the S3 File Gateway in year one is 48 TB. For subsequent years we assumed 10% data growth, meaning 10% of the previous year’s total volume is new data written by the S3 File Gateway. |
2 | S3 Standard – PUT requests: You pay for requests made against your S3 buckets and objects. S3 request costs are based on the request type, and are charged on the quantity of requests. KHUH quantities: In year one, all 476 million files will be written to S3 and will raise the equivalent number of PUT requests. For the subsequent years, similar to data volume, we assumed 10% growth in number of files. Only the new files will be written to S3 after year one. |
3A & 3B | S3 Standard – storage cost: All files replicated by S3 File Gateway to the S3 bucket will be stored in the S3 Standard storage class by default. A lifecycle policy will move the files to S3 Glacier Instant Retrieval (architecture option A) or S3 Glacier Flexible Retrieval (architecture option B). Not all files will be moved to an S3 Glacier storage class. Amazon S3 does not transition objects smaller than 128 KB from the S3 Standard storage class to the S3 Glacier Instant Retrieval storage class (see ‘Constraints’ section in Transitioning objects using Amazon S3 Lifecycle). For each object that is stored in S3 Glacier Flexible Retrieval, Amazon S3 adds 40 KB of chargeable overhead for metadata, with 8 KB charged at S3 Standard rates and 32 KB charged at S3 Glacier Flexible Retrieval rates. To simplify the cost estimation, we assumed, for architecture option B, that files smaller than 40 KB will remain in the S3 Standard storage class. The labels in the diagram represent the different amount of data that will be stored in the S3 Standard storage class depending on the architecture option: (3A) option A, all files smaller than 128 KB; (3B) option B, all files smaller than 40 KB. KHUH quantities: Option A (3A) – S3 Glacier Instant Retrieval: the total data volume for files smaller than 128 KB is 6 TB. Option B (3B) – S3 Glacier Flexible Retrieval: the total data volume for files smaller than 40 KB is 3.2 TB. |
4A & 4B | Lifecycle requests: When transitioning files from S3 Standard to the S3 Glacier storage classes, S3 Lifecycle Transition request fees apply. The labels in Figure 1 above indicate the different number of files that will be moved by S3 Lifecycle Transition depending on the architecture option: 4A – all files larger than 128 KB; 4B – all files larger than 40 KB. KHUH quantities: Option A (4A) – S3 Glacier Instant Retrieval: The total number of files larger than 128 KB that are transitioned is 89 million. Option B (4B) – S3 Glacier Flexible Retrieval: The total number of files larger than 40 KB that are transitioned is 125 million. |
5A | S3 Glacier Instant Retrieval – storage cost: The storage cost for architecture option A. KHUH quantities: The total data volume for files larger than 128 KB is 42 TB. |
5B | S3 Glacier Flexible Retrieval – storage cost: The storage cost for architecture option B. KHUH quantities: The total data volume for files larger than 40 KB is 44.8 TB. |
6A | S3 Glacier Instant Retrieval – retrieval cost: The charge per GB for the data volume returned in architecture option A. KHUH quantities: During the storage assessment, KHUH found that only 7% of historic PACS data is retrieved annually. Therefore, the number of files retrieved is 89 million files x 7% / 12 months = 519,167 files per month = 0.24 TB. |
6B | S3 Glacier Flexible Retrieval – retrieval cost: The charge per GB for the data volume returned in architecture option B. For the discussion in this blog post, we estimated the cost for the option of Bulk retrieval. This option comes with 5 – 12 hours retrieval time, but has the benefit of being free of charge. KHUH quantities: The number of files retrieved from archive is 125 million files x 7% / 12 months = 729,167 files per month = 0.26 TB. |
7A | S3 Glacier Instant Retrieval – GET requests: You pay for requests made against your S3 buckets and objects. For the cost estimate we assume one GET request per file. KHUH quantities: The number of files retrieved from archive is 89 million files x 7% / 12 months = 519,167 files = GET requests per month. |
7B | S3 Standard – GET requests & storage: Objects restored from S3 Glacier Flexible Retrieval will be placed in the S3 Standard storage class. From S3 Standard, files are replicated back to S3 File Gateway. KHUH quantities: The number of files retrieved from archive is 125 million files x 7% / 12 months = 729,167 files = GET requests per month. The data volume that is retrieved every month is 44.8 TB x 7% / 12 months = 0.26 TB per month. |
8 | Retrieved data egress: Retrieved data is sent from AWS to the S3 File Gateway which incurs outbound data transfer fees. KHUH quantities: The data volume transferred from the archive in AWS back to the KHUH data center has been estimated as follows: 48 TB x 7% / 12 months = 0.28 TB retrieved per month. |
The AWS pricing calculator tool was used to calculate prices for each cost element.
It is important to note that the AWS pricing calculator presents each line item as a monthly cost and calculates a yearly cost by simply multiplying the sum of all line items by 12 months. For the KHUH cost estimate, this calculation is only applicable to the first year. This is the year where all data is migrated to AWS.
To estimate the cost for subsequent years, a data growth of 10% per year has been assumed to extrapolate cost for years two to five.
Figure 2: Cost projection for five years, comparing S3 Glacier Instant Retrieval vs. S3 Glacier Flexible Retrieval
The major reason for the cost difference between option A and B in the first year is the cost for lifecycle requests. In the first year, all PACS data is migrated to AWS and the whole dataset is subject to lifecycle transitions. The PACS system stores a large number of small files (average file size is around 350 KB). Since the per object transition and metadata costs is higher for S3 Glacier Flexible Retrieval, for small objects S3 Glacier Instant Retrieval is more economical.
Note, the charges for writing data via the S3 File Gateway is capped at $125 per month. This is equivalent to 12.5 TB data transferred through S3 File Gateway. Any data beyond 12.5 TB is transferred free of charge. In the cost estimate, it has been assumed that the data is transferred evenly across 12 months. If the data was migrated to AWS within one month, the cost would be capped at $125.
The number of files has a significant impact on the lifecycle transition costs as well as the additional user-defined name and metadata storage costs for S3 Glacier Flexible Retrieval. With less objects, KHUH could reduce their request cost. This observation led KHUH to seek optimizations in their current PACS solution that are presented in the next section.
Cost optimization
While conducting the storage assessment, KHUH found that the PACS system stores a large number of small files; around 74% of files are smaller than 40 KB.
Consulting the PACS vendor, it was determined that these small files are required to enhance the end user experience. The files represent thumbnails that allow fast preview loads in a PACS viewer while the actual image series is retrieved in the background. These smaller files do not need to be transferred to AWS because the PACS recreates them when medical image studies are restored from archive.
As a result of these findings, KHUH, with support of the PACS vendor, reorganized the on-premises storage system. KHUH added additional storage volumes to the local file server dedicated for temporary files. In the PACS system, a rule was configured that separates temporary files and files required for long-term storage on separate volumes. Only the long-term storage volumes are then replicated to AWS via the S3 File Gateway.
With the data collected during the storage assessment, it was possible to assess the cost impact of excluding the temporary files from long-term storage.
For the cost estimate discussed below, files smaller than 40 KB are considered temporary files and are excluded from long-term storage.
All files |
Excluding temporary files |
|
Number of files: | 476,000,000 |
125,000,000 |
Data volume: | 48 TB |
44.8 TB |
Figure 3: Number of files and data volume with and without temporary files
Although the data volume was reduced by only 7%, the number of files was reduced by 74%.
As a result, the total cost for five years has been reduced.
Figure 4: Optimized cost after excluding temporary files that don’t need to be archived
The reduction for the five years cost is 28% for option A and 25% for option B, although the data volume was reduced by only 7%. The cost reduction mainly stems from removing the large number of small temporary files which in turn reduced the cost for S3 requests and S3 storage cost.
This evaluation shows that not only the total data volume, but also the number of files should be taken into consideration for file migrations to AWS. This is especially true for datasets with a large number of files, as in KHUH’s case.
Conclusion
With the results from the optimized cost estimate, KHUH was able to make an informed decision in favor of using the S3 Glacier Instant Retrieval storage class for long term archive. Not only is this storage class the most cost effective over a five-year period, but also offers the benefit of being able to retrieve medical imaging data within milliseconds when needed. This allows doctors to spend more time with their patients and be more efficient with their precious time, as they do not need to request retrievals of patient data from the archive prior to patient appointments.
While working on the cost estimate, KHUH learned that a storage assessment can uncover optimization potential for an existing on-premises installation. KHUH, for example, found that a large number of files are only temporary and as a result, reorganized their on-premises storage. Temporary files are kept on separate volumes and are excluded from the archive process.
We hope the approach outlined in this blog series provides you with guidance for how to use AWS for long term archival of medical image data. Using Amazon S3 File Gateway enables a hybrid architecture which reduces the need for changes in existing PACS. Assessing the on-premises storage systems and building a cost estimate ensures the right architectural decisions are made for a cost-effective solution. Lastly, involve your PACS vendor during the assessment to uncover potential optimizations that can be implemented before migrating to the cloud.
Thanks for reading this post! If you have any comments or questions, leave them in the comments section.