AWS Storage Blog
Automatically sync files from Amazon WorkDocs to Amazon S3
Today, many customers use Amazon S3 as their primary storage service for various use cases, including data lakes, websites, mobile applications, backup and restore, archive, big data analytics, and more. Versatile, scalable, secure, and highly available worldwide, S3 serves as a cost-effective data storage foundation for countless application architectures. Often, customers want to exchange files and documents between Amazon WorkDocs and Amazon S3. In our previous blog, we covered the process to auto-sync files from Amazon S3 to Amazon WorkDocs. In this blog post, we cover the sync process from Amazon WorkDocs to Amazon S3.
WorkDocs provides secure cloud storage and allows users to share and collaborate on content with other internal and external users easily. Additionally, Amazon WorkDocs Drive enables users to launch content directly from Windows File Explorer, macOS Finder, or Amazon WorkSpaces without consuming local disk space. Amazon S3 and Amazon WorkDocs both support rich API operations to exchange files.
Manually moving individual objects from WorkDocs to Amazon S3 can become tedious. Many customers are looking for a way to automate the process, enabling them to have their files available in S3 for further processing.
In this post, we walk you through setting up an auto-sync mechanism for synchronizing files from Amazon WorkDocs to Amazon S3 using Amazon API Gateway and AWS Lambda. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. AWS Lambda lets you run code without provisioning or managing servers. This enables you to be flexible and pay for only the compute time you consume without needing to pre-plan. This tool enables end users to focus on analyzing data and avoid manual efforts for file movement from Amazon WorkDocs to Amazon S3, saving them time thereby improving overall productivity and efficiency.
Solution overview
A common approach to automatically syncing files from Amazon WorkDocs to Amazon S3 is to set up an auto-sync tool using a Python module in AWS Lambda. We show you how to create this solution in the following steps. The following diagram shows each of the steps covered in this post:
The scope of this post is limited to the following steps:
- Creating Amazon WorkDocs folders
- Setting up this solution’s Amazon S3 components
- Creating AWS Systems Manager Parameter Store
- Setting up of Amazon SQS queue
- Setting up Amazon API Gateway
- Building AWS Lambda code with Python
- Setting up the WorkDocs notification
- Testing the Solution
As a first step, we create the Amazon WorkDocs folders, which generate WorkDocs folder IDs. We also set up an Amazon S3 bucket to receive the files. We use AWS Systems Manager Parameter Store to capture the Amazon S3 bucket name, WorkDocs folder IDs, folder names, and file extensions that need to sync. AWS Lambda uses the AWS Systems Manager Parameter Store to retrieve the information stored. We use Amazon API Gateway to integrate with Amazon SQS. We use an Amazon SQS queue to reprocess API events in case of a failure while syncing Amazon WorkDocs files to Amazon S3. Amazon SQS queues the Amazon API Gateway events and triggers AWS Lambda. As part of the process, we also enable WorkDocs notifications and subscribe to it using API Gateway to process the events generated from Amazon WorkDocs.
Prerequisites
For the following example walkthrough, you need access to an AWS account with admin access in the us-east-1 Region.
1. Creating Amazon WorkDocs folders
We use the Amazon WorkDocs folders created in this section to sync up with Amazon S3.
If your organization has no prior use of Amazon WorkDocs, then follow the steps to create an Amazon WorkDocs site, which generates a site URL as shown in the following screenshot. Then, select the Site Url and log in to the site.
Then, create a folder named “test_user_1_reports” by choosing Create and selecting Folder.
Once you have created the folder, it appears in WorkDocs.
Note the folder ID for the folder you created. Find the folder ID in the URL of each page (after the word “folder/” in the URL).
The “test_user_1_reports” folder ID
2. Setting up this solution’s Amazon S3 components
Create an Amazon S3 bucket with public access blocked and with the default encryption of SSE-S3. This configuration is for this sample solution, but please follow the compliance for configuring an Amazon S3 bucket as per your organization.
3. Creating AWS Systems Manager Parameter Store
1. Create a Parameter Store named “/dl/workdocstos3/bucketname” for storing the Amazon S3 bucket names.
2. Create a Parameter Store named “/dl/workdocstos3/folderids” for storing the mapping between your Amazon WorkDocs folder ID and Amazon S3 prefix.
- Sample value: {“7532e719cd8f28088c920cc1816506389a4deb9db1b50c3e6dc70af665ed6dec”:”test_user_1_reports”}
3. Create a Parameter Store named “/dl/workdocsos3/fileext” for storing the file extensions that should be synced from Amazon WorkDocs to Amazon S3.
- Sample value: {“file_ext”:”.pdf,.xlsx,.csv”}
4. Setting up Amazon SQS queue
Create an SQS Queue with Default visibility timeout as 15 minutes.
Create an IAM role to integrate Amazon SQS with Amazon API Gateway. Choose API Gateway as a use case and create the role.
Use the default policy as shown in the following screenshot and create the role.
Once the role is created, then add the additional policy “AmazonSQSFullAccess’ to the same role.
As shown in the following screenshot, you should have both policies attached to the IAM role.
5. Setting up Amazon API Gateway
Create an API Gateway with Rest API as the API type.
Create the API with REST as your protocol and select New API. Then, select Edge optimized as your Endpoint Type.
Once the API is created, add Create Method.
Create a POST method, as shown in the following screenshot.
Once you select the POST method, select the checkmark icon as shown in the following screenshot:
Fill in the details per the following screenshot and Save.
- Path override should have value as <AWS account#>/<SQS name>
- Execution Role should have the value of the IAM role ARN created in the preceding section.
Select Integration Request, as shown in the following screenshot.
Fill in the HTTP Headers and Mapping Templates sections, as shown in the following screenshot.
- Under HTTP Headers
- Name: Content-Type
- Mapped from:
-
'application/x-www-form-urlencoded'
-
- To integrate API Gateway with Amazon SQS, we need to map the incoming message body to the MessageBody of the Amazon SQS service and set the Action to SendMessage. For details, please refer “How do I use API Gateway as a proxy for another AWS service?” For this solution’s walkthrough, under Mapping Templates choose text/plain as Content-Type, and under Generate template the provide value as
Action=SendMessage&MessageBody=$util.urlEncode($input.body)
. Then, save it.
Once it’s saved, deploy the API. Choose Deploy API under API ACTIONS, as shown in the following screenshot.
Under the Deploy API prompt, fill in the details as shown in the following screenshot, and then Deploy.
Also, capture the API endpoint URL from the Stages tab, as shown in the following screenshot.
6. Building AWS Lambda code with Python
Create an AWS Lambda function with the name “workdocs_to_s3” using the following function code. Select the Python runtime version 3.8.
Also, create an AWS Lambda Layer compatible with Python 3.8 for Python’s Request library (2.2.4) and its dependencies.
import json
import boto3
import requests
import logging
sns_client = boto3.client('sns')
ssm_client = boto3.client('ssm')
workdocs_client = boto3.client('workdocs')
s3_client = boto3.client('s3')
logger = logging.getLogger()
logger.setLevel(logging.INFO)
## The function to confirm the subscription from Amazon Workdocs
def confirmsubscription (topicArn, subToken):
try:
response = sns_client.confirm_subscription(
TopicArn=topicArn,
Token=subToken
)
logger.info ("Amazon Workdocs Subscripton Confirmaiton Message : " + str(response))
except Exception as e:
logger.error("Error with subscription confirmation : " + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (https://docs.thinkwithwp.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the Amazon SQS service.
# Another mechanism could be to skip raising the error and Amazon Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process.
raise Exception("Error Confirming Subscription from Amazon Workdocs")
def copyFileworkdocstos3 (documentid):
# ssm parameter code
# Reading the Amazon S3 prefixes to Amazon Workdocs folder id mapping, Bucket Name and configured File Extensions from AWS System Manager.
try:
bucketnm = str(ssm_client.get_parameter(Name='/dl/workdocstos3/bucketname')['Parameter']['Value'])
folder_ids = json.loads(ssm_client.get_parameter(Name='/dl/workdocstos3/folderids')['Parameter']['Value'])
file_exts = str(json.loads(ssm_client.get_parameter(Name='/dl/workdocstos3/fileext')['Parameter']['Value'])['file_ext']).split(",")
logger.info ("Configured Amazon S3 Bucket Name : " + bucketnm)
logger.info ("Configured Folder Ids to be synced : : " + str(folder_ids))
logger.info ("Configured Supported File Extensions : " + str(file_exts))
resp_doc = workdocs_client.get_document (DocumentId = documentid)
logger.info ("Amazon Workdocs Metadata Response : " + str(resp_doc))
# Retrieving the Amazon Workdocs Metadata
parentfolderid = str(resp_doc['Metadata']['ParentFolderId'])
docversionid = str(resp_doc['Metadata']['LatestVersionMetadata']['Id'])
docname = str(resp_doc['Metadata']['LatestVersionMetadata']['Name'])
logger.info ("Amazon Workdocs Parent Folder Id : " + parentfolderid)
logger.info ("Amazon Workdocs Document Version Id : " + docversionid)
logger.info ("Amazon Workdocs Document Name : " + docname)
prefix_path = folder_ids.get(parentfolderid, None)
logger.info ("Retrieving Amaozn S3 Prefix Path : " + prefix_path)
## Currently the provided sample code supports syncing documents for the configured Amazon Workdocs Folder Ids in AWS System Manager and not for the sub-folders.
## It can be extended to supported syncing documents for the sub-folders.
if ( (prefix_path != None) and (docname.endswith( tuple(file_exts) )) ):
resp_doc_version = workdocs_client.get_document_version (DocumentId = documentid,
VersionId= docversionid,
Fields = 'SOURCE'
)
logger.info ("Retrieve Amazon Workdocs Document Latest Version Details : " + str(resp_doc_version))
## Retrieve Amazon Workdocs Download Url
url = resp_doc_version["Metadata"]["Source"]["ORIGINAL"]
logger.info ("Amazon Workdocs Download url : " + url)
## Retrieve Amazon Workdocs Document contents
## As part of this sample code, we are reading the document in memory but it can be enhanced to stream the document in chunks to Amazon S3 to improve memory utilization
workdocs_resp = requests.get(url)
## Uploading the Amazon Workdocs Document to Amazon S3
response = s3_client.put_object(
Body=bytes(workdocs_resp.content),
Bucket=bucketnm,
Key=f'{prefix_path}/{docname}',
)
logger.info ("Amazon S3 upload response : " + str(response))
else:
logger.info ("Unsupported File type")
except Exception as e:
logger.error("Error with processing Document : " + str(documentid) + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (https://docs.thinkwithwp.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the Amazon SQS service.
# Another mechanism could be to skip raising the error and Amazon Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process.
raise Exception("Error Processing Amazon Workdocs Events.")
def lambda_handler(event, context):
logger.info ("Event Recieved from Amazon Workdocs : " + str(event))
msg_body = json.loads(str(event['Records'][0]['body']))
## To Process Amazon Workdocs Subscription Confirmation Event
if msg_body['Type'] == 'SubscriptionConfirmation':
confirmsubscription (msg_body['TopicArn'], msg_body['Token'])
## To Process Amazon Workdocs Notifications
elif (msg_body['Type'] == 'Notification') :
event_msg = json.loads(msg_body['Message'])
## To Process Amazon Workdocs Move Document Event
if (event_msg['action'] == 'move_document'):
copyFileworkdocstos3 (event_msg['entityId'])
## To Process Amazon Workdocs Upload Document when a new version of the document is updated
elif (event_msg['action'] == 'upload_document_version'):
copyFileworkdocstos3 (event_msg['parentEntityId'])
else:
## Currently the provided sample code supports two Amazon Workdocs Events but it can be extended to process other Amazon Workdocs Events.
## Refer this link for details on other supported Amazon Workdocs https://docs.thinkwithwp.com/workdocs/latest/developerguide/subscribe-notifications.html.
logger.info("Unsupported Action Type")
else:
## Currently the provided sample code supports two Amazon Workdocs Events but it can be extended to process other Amazon Workdocs Events.
## Refer this link for details on other supported Amazon Workdocs https://docs.thinkwithwp.com/workdocs/latest/developerguide/subscribe-notifications.html.
logger.info("Unsupported Event Type")
return {
'statusCode': 200,
'body': json.dumps('Hello from Amazon Workdoc sync to Amazon S3 Lambda!')
}
The following screenshot shows the AWS Lambda Layer created based on Python’s Request library (2.2.4):
Add the AWS Lambda Layer to AWS Lambda function. For more details, refer to the documentation on configuring a function to use layers.
Update the AWS Lambda function “workdocs-to-s3” Timeout and Memory (MB) settings as shown in the following screenshot (15 min 0 seconds and 3008 MB, respectively). For more details, refer to the documentation on configuring Lambda function memory.
Update the AWS Lambda function’s “workdocs-to-s3” IAM execution role by selecting the AWS Lambda function and traversing to the Permissions tab. For more details, refer AWS Lambda execution role.
In this example, we add the following AWS managed policies:
- AmazonSQSFullAccess
- AmazonS3FullAccess
- AmazonSSMFullAccess
- AmazonSNSFullAccess
- AmazonWorkDocsFullAccess
Note: In this example for simplicity, the AWS Lambda IAM Execution role will be provided full access to the concerned AWS services. We recommend enhancing the AWS Lambda function’s IAM execution role to provide more granular access for a production environment. For more details, refer to the documentation on policies and permissions in IAM.
Attach all the required policies, as shown in the following screenshot.
Add a trigger to AWS Lambda by using the SQS Queue that was created. Change the Batch size to 1.
7. Setting up the WorkDocs notification
You need an IAM role to set up WorkDocs notifications. For this blog purpose, we use an admin role. You can refer here for more details.
In the WorkDocs console, access WorkDocs notifications by selecting Manage Notifications under Actions, as shown in the following screenshot.
Select Enable Notification, as shown in the following screenshot:
Provide the ARN from the preceding section and select Enable.
Access AWS CloudShell from the AWS Management Console. Run the following command to subscribe to the notification. To get the organization-id value, please refer to this link.
aws workdocs create-notification-subscription \ --organization-id <directory id from Directory Service> \ --protocol HTTPS \ --subscription-type ALL \ --notification-endpoint <Api Endpoint from Setting up Amazon API Gateway step>
8. Testing the Solution
First, verify that the WorkDocs folder and Amazon S3 bucket are empty. Then, upload a file into the WorkDocs folder.
Next, you should see that the file is available in Amazon S3.
Things to consider
This solution should help you set up an auto-sync mechanism for files from Amazon WorkDocs to Amazon S3. For more ways to expand this solution, consider the following factors.
File size
This solution is designed to handle files in the range of a few MBs to 2 GB. As part of the solution, the file is read in memory before syncing it to Amazon S3, but the Lambda code can be enhanced to stream the file in chunks to improve memory utilization and handle large files.
Monitoring
Monitoring can be done using Amazon CloudWatch, which acts as a centralized logging service for all AWS services. You can configure Amazon CloudWatch to trigger alarms for AWS Lambda failures. You can further configure the CloudWatch alarms to trigger processes that can re-upload or copy the failed Amazon S3 objects. Another approach would be to configure Amazon SQS dead-letter queues as part of the Amazon SQS, capturing the failed messages based on the number of configured retries to invoke a retry process.
IAM policy
We recommend you turn on S3 Block Public Access to ensure that your data remains private. To ensure that public access to all your S3 buckets and objects is blocked, turn on block all public access at the account level. These settings apply account-wide for all current and future buckets. If you require some level of public access to your buckets or objects, you can customize the individual settings to suit your specific storage use cases. Also, update your AWS Lambda execution IAM role policy, Amazon WorkDocs enable notification role, and Amazon SQS access policy to follow the standard security advice of granting least privilege or granting only the permissions required to perform a task.
Amazon WorkDocs document locked
If the WorkDocs document is locked for collaboration, it will sync to Amazon S3 only after unlocking or releasing the document.
Lambda batch size
For our example in this blog post, we used a batch size of 1 for the AWS Lambda function’s Amazon SQS trigger. As shown in the following screenshot, this can be modified to process multiple events in a single batch. In addition, you can extend the AWS Lambda function code to process multiple events and handle partial failures in a particular batch.
Note: AWS services generate events that invoke Lambda functions, and Lambda functions can send messages to AWS services. To avoid infinite loops, we recommend care to ensure that Lambda functions do not invoke services or APIs in a way that trigger another invocation of that function.
Cleaning up and pricing
To avoid incurring future charges, delete the resources set up as part of this post:
- Amazon WorkDocs
- API Gateway
- Amazon SQS
- Systems Manager parameters
- AWS Lambda
- S3 bucket
- IAM roles
For the cost details, please refer to the service pages: Amazon S3 pricing, Amazon API Gateway pricing, Lambda pricing, Amazon SQS pricing, AWS Systems Manager pricing, and Amazon WorkDocs pricing.
Conclusion
This post demonstrated a solution for setting up an auto-sync mechanism for synchronizing files from Amazon WorkDocs to Amazon S3 in near-real-time using Amazon API Gateway and AWS Lambda. This will avoid the tedious manual activity of moving files from Amazon WorkDocs to Amazon S3 and let customers focus on data analysis.
Thanks for reading this post on automatically syncing files from Amazon WorkDocs to Amazon S3. If you have any feedback or questions, feel free to leave them in the comments section. You can also start a new thread on the Amazon WorkDocs forum.