AWS Storage Blog
Automatically sync files from Amazon S3 to Amazon WorkDocs
UPDATE: A companion blog post for this solution detailing automatically syncing files from Amazon WorkDocs to Amazon S3 was published on 9/13/2021.
Today, many customers use Amazon S3 as their primary storage service for a range of use cases, including data lakes, websites, mobile applications, backup and restore, archive, big data analytics, and more. Versatile, scalable, secure, and highly available all over the world, S3 serves as a cost-effective data storage foundation for countless application architectures. Often, customers want to share and collaborate on files and documents contained in their S3 datasets using Amazon WorkDocs. WorkDocs provides secure cloud storage and allows users to share and collaborate on content with other internal and external users easily. Additionally, Amazon WorkDocs Drive enables users to launch content directly from Windows File Explorer, macOS Finder, or Amazon WorkSpaces without consuming local disk space. Amazon S3 and Amazon WorkDocs both support rich APIs to exchange files.
Manually moving individual objects from S3 to WorkDocs for collaboration can become tedious. Many customers are looking for a way to automate the process, enabling them to have their files available for sharing and collaboration automatically.
In this post, we walk you through setting up an auto sync mechanism for synchronizing files from Amazon S3 to Amazon WorkDocs using AWS Lambda. With this solution, users in an organization to collaborate on objects in shared datasets. AWS Lambda lets you run code without provisioning or managing servers. This enables you to be flexible and pay for only the compute time you consume, without needing to pre-plan. This tool enables end users to focus on collaboration and avoid manual efforts for file movement from Amazon S3 to Amazon WorkDocs, saving them time thereby improving overall productivity and efficiency.
Solution overview
A common approach to automatically syncing files from Amazon S3 to Amazon WorkDocs is to set up an auto sync tool using a Python module in AWS Lambda. We show you how to create this solution in the following steps. The following diagram shows each of the steps covered in this post:
The scope of this post is limited to the following steps:
- Creating an Amazon WorkDocs folder and subfolder.
- Creating parameters in an AWS Systems Manager Parameter Store.
- Building an AWS Lambda function with Python.
- Setting up an Amazon SQS queue.
- Setting up this solution’s Amazon S3 components.
As a first step, we create the Amazon WorkDocs folder and subfolders, which generate WorkDocs folder IDs. We use an AWS Systems Manager Parameter Store to capture the Amazon WorkDocs folder IDs and folder names. AWS Lambda uses the AWS Systems Manager Parameter Store to retrieve the WorkDocs folder ids and folder names at runtime. We use an Amazon SQS queue to enable reprocessing of Amazon S3 events in case of failure while syncing Amazon S3 files to Amazon WorkDocs – ensuring that no event goes unprocessed. Amazon SQS queues the Amazon S3 events and triggers AWS Lambda. We must also create the S3 bucket and configure the events we want to receive.
Prerequisites
For the following example walkthrough, you need access to an AWS account with admin access in the us-east-1 Region.
Creating an Amazon WorkDocs folder and sub-folder
We use the Amazon WorkDocs folders created in this section to sync up with Amazon S3.
If your organization has no prior use of Amazon WorkDocs, then follow the steps to create an Amazon WorkDocs site, which generates a Site URL as shown in the following screenshot.
Click on the URL and log in to the site. Then, create a folder with name “datalake-final-reports” by choosing Create and selecting Folder.
Once you have created the folder, it appears in WorkDocs:
Create two subfolders, named “user-group-1-reports” and “user-group-2-reports” inside the “datalake-final-reports” folder.
Note the folder IDs for the folder you created, along with the two subfolders you created. Find the folder and subfolder IDs in the URL of each page (after the word “folder/” in the URL).
The main folder (“datalake-final-reports”) folder ID
The subfolder (“user-group-1-reports”) folder ID
The subfolder (“user-group-2-reports”) folder ID
Creating an AWS Systems Manager Parameter Store
You must store the Amazon WorkDocs folder IDs and folder names as a JSON object (the following code snippet) in an Amazon WorkDocs Parameter Store (“/dl/workdocs/folderids”). You can use a Parameter Store to map Amazon S3 folders (in buckets) to corresponding Amazon WorkDocs folder IDs, as we do later in this example.
To create the parameter in the Parameter store, first go to AWS Systems Manager in the AWS Management Console, then choose Parameter Store.
Click on Create parameter, and enter the name of the parameter as “/dl/workdocs/folderids”. Then, use the following JSON string as a template to create the JSON string that needs to be pasted in the Value section of the parameter. This JSON value is used to store the Amazon WorkDocs folder IDs. Keep in mind to replace the subfolder IDs with the ones created on your account created in the preceding step.
{ "user-group-1-reports": "64d739bc5f2fefc21ef19a57d0f29bc8a98142993127984ab923dcd0c7bccf46",
"user-group-2-reports": "4c0986d8e12278a0cc242751b38c875d7c199906a87d5cfee87d64d741a770df" }
Keep the rest of the parameter details as is, and this is what is should like (with your unique JSON string) if the parameter is configured correctly:
Now, we filter on specific file extensions that sync up to Amazon WorkDocs by creating the following parameter. In this example, we only sync up files ending with .pdf, .xlsx, and .csv from Amazon S3 to Amazon WorkDocs by storing the file extensions in a new parameter. As with the preceding parameter, configure this parameter the same, but name it “/dl/workdocs/fileext”, and use this JSON string as the Value:
{"file_ext":".pdf,.xlsx,.csv"}
Building AWS Lambda code with Python
Create an AWS Lambda function with the name “s3-to-workdocs” using the following code block. For Python, select the runtime version 3.8. We use this Lambda code to sync up the files from Amazon S3 to Amazon WorkDocs.
import boto3
import requests
import os
import logging
import json
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
ssm_client = boto3.client('ssm')
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
# Event
# The Amazon S3 event received from Amazon SQS
logger.info("Event received : " + str(event))
# ssm parameter code
# Reading the Amazon S3 prefixes to Amazon Workdocs folder id mapping
try:
ssm_response_folder_id = ssm_client.get_parameter(
Name='/dl/workdocs/folderids'
)
ssm_param_string_folder_id= ssm_response_folder_id ['Parameter']['Value']
ssm_param_dict_folder_id = json.loads(ssm_param_string_folder_id)
logger.info("ssm_param_dict_folder_id for workdoc configs : ")
logger.info(ssm_param_dict_folder_id)
except Exception as e:
logger.error("Error with Event : " + str(event) + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (https://docs.thinkwithwp.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the Amazon SQS service
# Another mechanism could be to skip raising the error and Amazon Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process
raise Exception("Error reading Amazon S3 prefixes to Amazon Workdocs folder id mapping from AWS Systems Manager Parameter Store : /dl/workdocs/folderids.")
# ssm to get configured file extentions
# Reading the Amazon S3 object extentions configured to be synced with Amazon Workdocs
try:
ssm_response_file_ext = ssm_client.get_parameter(
Name='/dl/workdocs/fileext'
)
ssm_param_string_file_ext= ssm_response_file_ext ['Parameter']['Value']
ssm_param_file_ext = str(json.loads(ssm_param_string_file_ext)['file_ext']).split(",")
logger.info("ssm_param_file_ext for allowed file extentions is ")
logger.info(ssm_param_file_ext)
except Exception as e:
logger.error("Error with Event : " + str(event) + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (https://docs.thinkwithwp.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the Amazon SQS service
# Another mechanism could be to skip raising the error and Amazon Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process
raise Exception("Error reading Amazon S3 object extentions configured to be synced with Amazon Workdocs from AWS Systems Manager Parameter Store : /dl/workdocs/fileext.")
# The below event is processed considering the Amazon SQS trigger's batch = 1
# If the bacthsize is increased to more than 1 then the below section of code can be extended to process all the events in a batch using a for-loop. Handling individual event errors within a batch also needs to be considered.
try:
evnt = event['Records'][0]
source_bucket = json.loads(evnt['body'])['Records'][0]['s3']['bucket']['name']
s3_object_path = json.loads(evnt['body'])['Records'][0]['s3']['object']['key']
# If the Amazon S3 Object name has special characters or spaces that needs to handled accordingly.
s3_object_name = os.path.basename(s3_object_path)
s3_obj = s3_resource.Object(source_bucket, s3_object_path)
s3_obj_response_body=s3_obj.get()['Body'].read()
s3_obj_response=s3_obj.get()
obj_type= s3_obj_response['ContentType']
workdocs_client = boto3.client('workdocs')
# get signed URL from workdocs by passing the folder ID from ssm
skipped_flg = 1
for s3_prefix, workdocid in ssm_param_dict_folder_id.items():
if ( ( (s3_prefix+'/') in s3_object_path) and (s3_object_path.endswith( tuple(ssm_param_file_ext) ) ) ):
workdocs_response = workdocs_client.initiate_document_version_upload(
Name=s3_object_name,
ContentType=obj_type,
ParentFolderId=workdocid
)
# upload the file into workdocs folder
URL=workdocs_response['UploadMetadata']['UploadUrl']
workdocs_upload = requests.put(URL,headers = workdocs_response['UploadMetadata']['SignedHeaders'],data=s3_obj_response_body)
logger.info(f'File upload HTTP status code: {workdocs_upload.status_code}')
# update the document that got uploaded as latest version
update_response = workdocs_client.update_document_version(
DocumentId=workdocs_response['Metadata']['Id'],
VersionId=workdocs_response['Metadata']['LatestVersionMetadata']['Id'],
VersionStatus='ACTIVE'
)
logger.info('Successfully transferred object {} from bucket {} '.format(s3_object_path, source_bucket ))
skipped_flg = 0
if (skipped_flg == 1):
logger.info('Skipped transferring object {} from bucket {} '.format(s3_object_path, source_bucket))
except Exception as e:
logger.error('Error transferring object {} from bucket {}. Exception is {} '.format(s3_object_path, source_bucket, str(e)))
logger.error("Error with Event : " + str(event) + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (https://docs.thinkwithwp.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the Amazon SQS service
# Another mechanism could be to skip raising the error and Amazon Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process
raise Exception("The event processing ran into issues")
Create an AWS Lambda layer for Python’s Request library (2.2.4) and its dependencies compatible for Python 3.8. This is necessary because the Python library must connect to the Amazon WorkDocs API.
The AWS Lambda layer created based on the Python’s Request library (2.2.4).
Add the AWS Lambda Layer to AWS Lambda function. For more details, refer to the documentation on configuring a function to use layers.
Update the AWS Lambda function “s3-to-workdocs” Timeout and Memory (MB) settings as shown in the following screenshot (15 minutes 0 seconds and 3008 MB, respectively). This is the maximum time value and sufficient memory to support files up to 2 GB, enabling us to avoid any failures due to potential timeout. For more details, refer to the documentation on configuring Lambda function memory.
Update the AWS Identity and Access Management (IAM) policies of the AWS Lambda function “s3-to-workdocs” in the Permissions tab of the function. This secures the required permissions for the function, to access the necessary resources at each phase. For more details, refer to the documentation on AWS Lambda execution role.
In this example, we add the following AWS IAM policies (For more details, refer to the documentation on adding IAM identity permissions):
- SecretsManagerReadWrite
- AmazonSQSFullAccess
- AmazonS3FullAccess
- AmazonWorkDocsFullAccess
Note: In this example, the IAM policies provide the AWS Lambda function with full access to the concerned AWS services, for simplicity. We recommend expanding the AWS Lambda function’s IAM roles to provide a production environment with access that is more granular. For more details, please refer to the documentation on policies and permissions in IAM.
We also add an inline policy (“ssm-get_parameters-policy”) using the following snippet:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:DescribeParameters"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetParameters"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetParameter"
],
"Resource": "*"
}
]
}
The following screenshot shows the updated AWS Lambda function, after adding the IAM policies.
Note: The IAM role name could be different in your case.
Setting up of Amazon SQS queue
As mentioned, we use an Amazon SQS queue to ensure that processing of Amazon S3 event notifications experiences no failures, and that all events are processed. Amazon SQS also triggers our AWS Lambda function, facilitating the next step in the solution.
Create an Amazon SQS queue with name “s3_work_doc_queue” and set the visibility time as 900 seconds. Keep rest of the parameters as default.
Now, add the Amazon SQS (“s3_work_doc_queue”) as a trigger to the AWS Lambda function. Go to the AWS Lambda service, select the AWS Lambda function (“s3-to-workdocs”), and click on Add trigger. For more details, please refer Using AWS Lambda with Amazon SQS.
Set the Batch size of the Amazon SQS trigger to 1, because we designed the Lambda function in this example to process each Amazon S3 event independently. Please refer to the “Lambda batch size” section to check out how you can enhance this solution for batch processing of Amazon S3 events.
Setting up this solution’s Amazon S3 components
Amazon S3 now supports strong consistency, which provides strong read-after-write consistency . This ensures that the AWS Lambda function (in this example) reads the latest write of the object in Amazon S3. You can also maintain multiple versions of Amazon S3 objects by enabling Amazon S3 versioning. In this example, we used the default setting, but you can enable it as per your business needs.
Create an Amazon S3 bucket, meeting the bucket naming standards. Select the bucket and create folders within the bucket using the same name as Amazon WorkDocs subfolders (“user-group-1-reports” and “user-group-2-reports”) as shown in the following screenshot:
Update the Amazon SQS access policy
Go to the Amazon SQS service and select the Amazon SQS queue (“s3_work_doc_queue”). Navigate to the Access Policy tab, and click Edit to update the Amazon SQS queue access policy.
Replace the policy JSON statement, and replace it with the following one. This access policy provides Amazon S3 service permission to send Amazon S3 events to this queue. For more details, please refer to the documentation on creating an Amazon SQS queue.
Note: Replace the <AccountID> with your AWS account ID and the <bucket_name> with the bucket name created in the preceding step.
{
"Version": "2008-10-17",
"Id": "__default_policy_ID",
"Statement": [
{
"Sid": "__owner_statement",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<AccountID>:root"
},
"Action": "SQS:*",
"Resource": "arn:aws:sqs:us-east-1:<AccountID>:s3_work_doc_queue"
},
{
"Sid": "s3_event-access",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "SQS:SendMessage",
"Resource": "arn:aws:sqs:us-east-1:<AccountID>:s3_work_doc_queue",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "<AccountID>"
},
"ArnLike": {
"aws:SourceArn": "arn:aws:s3:*:*:<bucket_name>"
}
}
}
]
}
Setting up Amazon S3 event notifications to send events to Amazon SQS
Go to Amazon S3 in the AWS Management Console and select the Amazon S3 bucket created in the preceding step. Then, select the Properties tab, and click on Create event notification.
Provide a name in the Event name section.
Under Event Types select All object create events.
Configure the Destination as an Amazon SQS queue (“s3_work_doc_queue”), then choose Save changes.
After configuring the Amazon S3 events (All object create events) to send the events to an Amazon SQS (“s3_work_doc_queue”), you can see Event Types and Destination types on the Event notifications page, as shown in the following screenshot:
Testing the solution
As a first step, verify that the contents of the “user-group-1-reports” folder in the Amazon S3 bucket are empty.
Also verify that the Amazon WorkDocs subfolder “user-group-1-reports” is empty.
For testing, upload two sample files “weekly_ext_vendor_sales_report.csv” and “not_needed.msid” into the Amazon S3 bucket’s folder “user-group-1-reports.”
Verify the synchronization by going to the Amazon WorkDocs folder “user-group-1-reports”. Only the “weekly_ext_vendor_sales_report.csv” file is synced from Amazon S3 to Amazon WorkDocs, as the configured file extensions in AWS System Manager’s Parameter Store (/dl/workdocs/fileext) are .pdf, .xlsx, and .csv.
The Lambda function also publishes logs and metrics to Amazon CloudWatch, which can monitor the sync activity between Amazon S3 and Amazon WorkDocs.
Cleaning up and pricing
To avoid incurring future charges, delete the resources set up as part of this post:
- Amazon WorkDocs Site URL
- Amazon S3 bucket
- AWS Systems Manager parameters
- Amazon SQS queue
- AWS Lambda function
For the cost details, please refer to the following service pages:
- Amazon S3 pricing
- AWS Lambda pricing
- Amazon SQS pricing
- AWS Systems Manager pricing
- Amazon WorkDocs pricing.
Things to consider
This solution should help you set up an auto sync mechanism for files from Amazon S3 to Amazon WorkDocs. For more ways to expand this solution, consider the following factors.
Note: AWS services generate events that invoke Lambda functions, and Lambda functions can send messages to AWS services. To avoid infinite loops, we recommend care to ensure that Lambda functions do not invoke services or APIs in a way that trigger another invocation of that function.
File size
This solution is designed to handle files in the range of a few MBs to 2 GB. For bulk loads and large files, follow the steps in the documentation on migrating files to WorkDocs.
Monitoring
Monitoring can be done using Amazon CloudWatch, which acts as a centralized logging service for all AWS services. You can configure Amazon CloudWatch to trigger alarms for AWS Lambda failures. You can further configure the CloudWatch alarms to trigger processes that can re-upload or copy the failed Amazon S3 objects. Another approach would be to configure Amazon SQS dead-letter queues as part of the Amazon SQS, capturing the failed messages based on the number of configured retries to invoke a retry process.
IAM policy
We recommend you turn on S3 Block Public Access to ensure that your data remains private. To ensure that public access to all your S3 buckets and objects is blocked, turn on block all public access at the account level. These settings apply account-wide for all current and future buckets. If you require some level of public access to your buckets or objects, you can customize the individual settings to suit your specific storage use cases. Also, update AWS Lambda execution IAM role policy and Amazon SQS access policy to follow the standard security advice of granting least privilege, or granting only the permissions required to perform a task.
Lambda batch size
For our example in this blog post, we used a batch size of 1 for the AWS Lambda function’s Amazon SQS trigger. This can be modified, as shown in the following screenshot, to process multiple events in a single batch. In addition, you can extend the AWS Lambda function code to process multiple events and handle partial failures in a particular batch.
Conclusion
In this post, we demonstrated a solution for setting up an auto sync mechanism for synchronizing files from Amazon S3 to Amazon WorkDocs in near-real time. Using Amazon S3, Amazon WorkDocs, AWS Lambda, and Amazon SQS, you can set up a simple architecture to have a live and continuously updated Amazon WorkDocs destination for files. This enables you tedious manual activity of moving files from Amazon S3 to Amazon WorkDocs, so you can focus on collaboration and core competencies.
Thanks for reading this post on automatically syncing files from Amazon S3 to Amazon WorkDocs. If you have any feedback or questions, please leave them in the comments section. You can also start a new thread on the Amazon WorkDocs forum.