Automatically archive and restore data with Amazon S3 Intelligent-Tiering

Customers of all sizes, in all industries, are using data lakes to transform data from a cost that must be managed, to a business asset. From time to time, data scientists and business analysts need to restore subsets of historical datasets for longitudinal studies, machine learning retraining, and more. However, users commonly write queries that don’t consider what objects are not immediately accessible, leading to unexpected behavior within their application.

Today’s data lake applications generate massive quantities of data stored in Amazon S3, the largest and most performant object storage service of choice for over 200,000 data lakes. To control costs as data continues to grow, customers are using the Amazon S3 Intelligent-Tiering storage class by default for their data lake applications because it delivers automatic storage cost savings as access patterns change, with no impact on performance or operational overhead.

To save up to 95% on storage costs for data that is not accessed for months, or even years, at a time, customers are increasingly using the optional asynchronous Archive Access and Deep Archive Access tiers within the S3 Intelligent-Tiering storage class. At the same time, customers want a solution to automate data restores when they query objects that are not immediately accessible in the S3 Intelligent-Tiering Archive Access and Deep Archive Access tiers.

In this blog post, we share a solution to automate data restores in response to a GET call whenever the objects are not immediately accessible in the optional Archive and Deep Archive Access tiers. Before we get started with solution implementation, we provide you with a quick recap of S3 Intelligent-Tiering and the asynchronous archive access tiers.

Recap of how S3 Intelligent-Tiering works

The Amazon S3 Intelligent-Tiering storage class automatically stores objects in three access tiers: one tier optimized for frequent access, a lower-cost tier optimized for infrequent access, and a very-low-cost tier optimized for rarely accessed data. For a small monthly object monitoring and automation charge, S3 Intelligent-Tiering moves objects that have not been accessed for 30 consecutive days to the Infrequent Access tier for savings of 40%; and after 90 days of no access, they’re moved to the Archive Instant Access tier with savings of 68%. If the objects are accessed later, S3 Intelligent-Tiering moves the objects back to the Frequent Access tier.

How Amazon S3 Intelligent Tiering works with the automatic Access Tiers

To save even more on data that doesn’t require immediate retrieval, you can activate the optional asynchronous Archive Access and Deep Archive Access tiers within S3 Intelligent-Tiering. When turned on, objects not accessed for 90 days are moved directly to the Archive Access tier for savings of 71%, and to the Deep Archive Access tier after 180 days with up to 95% in storage cost savings. If the objects are accessed later, S3 Intelligent-Tiering automatically moves the objects back to the Frequent Access tier. If the objects you are retrieving are stored in the optional Archive Access or Deep Archive tiers, before you can retrieve them you must first restore a copy using RestoreObject. Read more about restoring archived objects.

How Amazon S3 Intelligent Tiering works with both opt-in asynchronous Archive Access tiers

See the following table for information on S3 Intelligent-Tiering access tiers for automatic storage cost savings:

Access Tier	Default behavior	Use case/benefits	Retrieval time
Frequent Access	Automatic	Suitable for frequently accessed objects Low latency and high throughput performance	Milliseconds
Infrequent Access	Automatic	Suitable for infrequently accessed objects Low latency and high throughput performance	Milliseconds
Archive Instant Access	Automatic	Suitable for rarely accessed objects Low latency and high throughput performance	Milliseconds
Archive Access	Optional	Suitable for data that can be accessed asynchronously Slightly lower storage cost Same performance as the S3 Glacier Flexible Retrieval storage class	Expedited (1-5 minutes) Standard (3-5 hours) Bulk (5-12 hours)
Deep Archive Access	Optional	Suitable for data that can be accessed asynchronously Low latency and high throughput performance Same performance as the S3 Glacier Deep Archive storage class Lowest storage cost in the cloud	Standard (within 12 hours) Bulk (within 48 hours)

Get started with the S3 Intelligent-Tiering asynchronous archive access tiers

To activate S3 Intelligent-Tiering automatic archiving using the S3 console, complete the following steps:

Sign in to the AWS Management Console and open the Amazon S3 console.
In the Buckets list, choose the name of the bucket that you want.
Choose Properties.
Navigate to the Intelligent-Tiering Archive configurations section and choose Create configuration.

Create S3 Intelligent-Tiering archive configuration

In the Archive configuration settings section, specify a descriptive configuration name for your S3 Intelligent-Tiering archive configuration.
Under Choose a configuration scope, select a configuration scope to use. Optionally, you can limit the configuration scope to specified objects within a bucket using a shared prefix, object tag, or a combination of the two.

i. To limit the scope of the configuration, select Limit the scope of this configuration using one or more filters.

ii. To limit the scope of the configuration using a single prefix, enter the prefix under Prefix.

iii. To limit the scope of the configuration using object tags, select Add tag and enter a value for Key.

Under Status, select Enable.

Archive configuration settings

Next, in the Archive rule actions section, select one or both of the archive access tiers to enable.

Note:

- When you enable one or both of the archive access tiers, you can define in how many days you want this object to transition to these access tiers – for the Archive Access tier, this number needs to be equal or greater than 90 and for the Deep Archive Access tier, this number needs to be equal or greater than 180 days.
- You can also extend the last access time for archiving objects in the optional asynchronous archive tiers by up to two years. For example, if you have an auditing workflow where you need to maintain milliseconds access to objects for one year (365 days), you might change the number of days since last access for archiving from the Archive Instant Access tier to the Deep Archive Access tier from 180 days to 365 days.

Select Create once all the desired configuration options are selected.

Archive rule actions

Once you have your S3 Intelligent-Tiering archiving configuration in place, you will automatically achieve the lowest storage cost for data that is not accessed for months at a time. If you need to access these objects in the future, you can leverage the solution we discuss in this blog post. This solution automates data restores from the optional asynchronous Archive Access and Deep Archive Access tiers in order to retrieve data from these tiers with GET call.

Automating restores from the S3 Intelligent-Tiering asynchronous access tiers

To get started, you can download the python script and associated requirements.txt, IAM policy, and SNS access policy from this GitHub link.

Prerequisites

To implement the solution, you need the following prerequisites:

Access to the AWS Management Console, AWS CLI, or AWS SDK.
Follow IAM role permissions for IAM user using the solution as provided here.
Follow SNS access policy permissions as provided here.
Refer to the above GitHub link for the code and save the txt file in the same directory with the python script. Run the requirements.txt file if you don’t have the required dependency to run the rest of the python scripts. You can run using this command.

pip3 install -r requirements.txt

Implementation details

To retrieve the object(s) in the Archive Access tier or Deep Archive Access tier, run the restoreS3IntArchive.py script completing the following instructions:

Input variables needed from users:

i. S3 bucket name

ii. S3 prefix/key

iii. SNS ARN

The correct SNS format is:

arn:aws:sns:[a-z0-9\-]+:[a-z0-9\-]+:[a-z0-9\-]*

Example- arn:aws:sns:<Region>:<AccountId>:<SNSTopicName>

Command to run the restoreS3IntArchive.py script:

restoreS3IntArchive.py '<BucketName>' '<Prefix/Key>' '<SNSArn>'

Example- restoreS3IntArchive.py 's3bucket' 'nothingtoseehere/nothing.csv0030part00' 'arn:aws:sns:us-east-1:123456789123:restore'

This restoreS3IntArchive.py script does the following:

Whenever customers make a GetObject call to access the object:

If HTTP InvalidObjectState error is returned in response, make HeadObject call to check ArchiveStatus.
If ‘ArchiveStatus’ is “DEEP_ARCHIVE_ACCESS” or “ARCHIVE_ACCESS”, check x-amz-restore status.

i. If x-amz-restore status does not exist, then execute restore object.

ii. If x-amz-restore status is true, then output restore is in progress.

iii. Otherwise, the object is already restored.

Before restoring the object, the code checks if any other S3 event is configured on the bucket. The S3 event configuration could be null or have pre-existing events, such as PUT, COPY, or even RESTORE.
To execute restore, add or append S3 event configuration of the bucket with ‘s3:ObjectRestore:Post’, ‘s3:ObjectRestore:Completed’.

i. If no S3 event is configured, add the S3 events for S3 restore initiation and restore completed events.

ii. If either of ‘s3:ObjectRestore:Post’ or/and ‘s3:ObjectRestore:Completed’ are not configured on the bucket, create a list of missing S3 restore events.

a. If Topic Configuration exists, append the existing Topic Configuration with the S3 restore event from the list and keep the rest of the S3 event policy unchanged.

b. Otherwise, add new Topic Configuration with the S3 restore event from the list and keep rest of the S3 event policy unchanged.

iii. Test Check the restore status.

a. If restore status is “DEEP_ARCHIVE_ACCESS”, restore an object {key} from the ‘DEEP_ARCHIVE_ACCESS’ tier to S3 INT FA within 12 hours.

b. Otherwise, restore an object {key} from the ‘ARCHIVE_ACCESS’ tier to S3 INT FA within 3-5 hours.

Note: You can incorporate the python script- restoreS3IntArchive.py from the GitHub link in your existing application code which performs GET operations to retrieve the objects in the optional asynchronous archive tiers.

Conclusion

In this blog post, we showed you how to activate the optional asynchronous access tiers within the S3 Intelligent-Tiering storage class to achieve the lowest storage cost for data that is rarely accessed. We are increasingly seeing customers activate the S3 Intelligent-Tiering Archive Access and Deep Archive Access tiers to automatically save up to 95% in storage costs when data is not accessed for months at a time, or longer. We also provide you with an easy to use solution to automate data restores when data that is not immediately accessible is accessed from the S3 Intelligent-Tiering Archive Access and Deep Archive Access tiers. The solution can be used as a framework for automatic restores that you can incorporate into your existing application and refer to the given IAM permissions policy to manage your IAM permissions as needed.

If you have feedback or questions about this post, please submit your comments in the comments section.

AWS Storage Blog