AWS for M&E Blog
Analyzing media files using ffprobe in AWS Lambda
Description
Managing an ever-changing movie library can be difficult, and many assets can be added in a very short period of time.
When dealing with a large number of video assets, it is very efficient to use a NoSQL database that provides centralized access to asset details like title, location, and metadata. Amazon DynamoDB—a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale—is a great match for this requirement.
To avoid manual analysis of each asset, this how-to guide describes the steps to invoke an automatic extraction of media asset metadata through ffprobe (part of the FFmpeg project) using the following AWS services:
- Amazon DynamoDB to store asset details
- AWS Lambda—a serverless, event-driven compute service—to run ffprobe on the media file and update the Amazon DynamoDB entry
- Amazon Simple Storage Service (Amazon S3)—an object storage service that offers industry-leading scalability, data availability, security, and performance—to store the asset file
Each asset detail—like identifier (ID), title, or location—is written in an Amazon DynamoDB table. Inserting a new asset into the Amazon DynamoDB table initiates a Lambda function that reads the media file. When you capture file and video metadata, call that out as the “analysis,” and enter the results in the Amazon DynamoDB table.
The Lambda function uses Python 3.8 runtime.
IMPORTANT LEGAL NOTICE: Before we start, make sure you are familiar with the terms of FFmpeg license and legal considerations as listed here. In addition, the FFmpeg static build used in this demo is licensed under GNU General Public License version 3 (GPLv3) as mentioned here.
Prerequisites
To complete this how-to guide, you will need access to the following:
- A Linux system to type commands (Shell and Python)
- AWS Lambda
- Amazon DynamoDB
- Amazon S3
- AWS Identity and Access Management (AWS IAM), which lets you manage access to AWS services and resources securely
Get started
FFmpeg
We need to download FFmpeg project. We chose the static build so that we do not miss any libraries. Prepare a ZIP file containing ffprobe binary using this code:
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz.md5
md5sum -c ffmpeg-release-amd64-static.tar.xz.md5 && \
mkdir ffmpeg-release-amd64-static
tar xvf ffmpeg-release-amd64-static.tar.xz -C ffmpeg-release-amd64-static
mkdir -p ffprobe/bin
cp ffmpeg-release-amd64-static/*/ffprobe ffprobe/bin/
cd ffprobe
zip -9 -r ../ffprobe.zip .
This can be done from a Linux system, using either your local computer or an Amazon Elastic Compute Cloud (Amazon EC2) instance. Amazon EC2 is a web service that provides secure, resizable compute capacity in the cloud.
We will use this ZIP archive to create an AWS Lambda layer.
AWS IAM
We need an AWS Identity and Access Management (IAM) user to insert our asset movie ID and title, Amazon S3 bucket, and object name into Amazon DynamoDB. This operation is performed through our Lambda function. Adding an entry in Amazon DynamoDB can be done using this example.
Our Lambda function requires the following permissions to manage resources related to our Amazon DynamoDB stream, create entries on Amazon CloudWatch (a monitoring and observability service), and push logs into and get the Amazon S3 object.
If you require further guidance to create policies or roles, please read this documentation:
- https://docs.thinkwithwp.com/IAM/latest/UserGuide/access_policies_create-console.html
- https://docs.thinkwithwp.com/IAM/latest/UserGuide/id_roles_create_for-service.html
Add these to your function’s execution role (called “lambda_media_ddb” in this example):
- dynamodb:DescribeStream
- dynamodb:GetRecords
- dynamodb:GetShardIterator
- dynamodb:ListStreams
- dynamodb:UpdateItem
- cloudwatch:CreateLogGroup
- cloudwatch:CreateLogStream
- cloudwatch:PutLogEvents
- s3:GetObject
Amazon DynamoDB
- Create an Amazon DynamoDB table, which we will name “my_movies” in this example.
- Choose a primary key. In this example, we have chosen “movie_id” and set the type to Number.
Amazon DynamoDB streams captures a time-ordered sequence of item-level modifications in any Amazon DynamoDB table. Applications can access this log and—in near real time—view the data items as they appeared before and after they were modified. We will use this log to initiate our Lambda function and use log data as an input in the Lambda function.
- Activate the stream on the table, and select the type New and old images in the stream management menu.
- From the DynamoDB stream details section, copy the latest stream ARN (Amazon resource name), which is required to initiate the Lambda function.
- When you upload an asset to your Amazon S3 bucket, manually insert a new row in your Amazon DynamoDB table, my_movies, following this example.
Lambda
- Create a Lambda layer, and import ffprobe.zip into it, following the example in this image:
- Create a Lambda function, using Python 3.8
In this example, our test asset is 34 MB, so it is not necessary to provide a large amount of memory to our function. However, this could be required for large assets.
- Attach the correct role (lambda_media_ddb).
Once you’ve created the function, in Designer section, follow these steps:
- Click Layers.
- Then click Add a layer.
- Choose Custom layers, and select ffprobe and the correct version; then click Add.
- In Designer, click Add trigger.
- Then select DynamoDB, and paste the latest stream ARN into DynamoDB table. Validate by clicking Add.
We are finished with the designer part.
- In the Edit basic settings menu, define a relevant timeout to give enough time for ffprobe to retrieve the file, analyze it, and update Amazon DynamoDB table entry.
- For file sizes less than 790 MB, set timeout to 1 second and memory to 200 MB.
- For larger files, confirm whether these values are enough and can adapt to your use case.
As a note, our tests (which included files less than 790 MB) took less than 1 second and used less than 200 MB of memory. You can find out more about memory and duration considerations later in this post.
- Now copy and paste the following Python code into your Lambda function, and deploy the Lambda function:
import json
import subprocess
import boto3
SIGNED_URL_TIMEOUT = 60
def lambda_handler(event, context):
error = False
s3_sign_url_returns = list()
ddb_insert_returns = list()
ffprobe_returns.append = list()
dynamodb_client = boto3.client('dynamodb')
steps_messages = dict()
for record in event['Records']:
if not record['eventName'] == 'INSERT':
print('Not an insert, skipping')
continue
movie_id = record['dynamodb']['NewImage']['movie_id']['N']
s3_source_bucket = record['dynamodb']['NewImage']['S3_bucket']['S']
s3_source_key = record['dynamodb']['NewImage']['S3_object']['S']
s3_client = boto3.client('s3')
steps_message[movie_id] = dict()
try:
s3_source_signed_url = s3_client.generate_presigned_url('get_object',
Params={'Bucket': s3_source_bucket, 'Key': s3_source_key},
ExpiresIn=SIGNED_URL_TIMEOUT)
message = 'Success'
except:
message = 'Failure'
error = True
steps_message[movie_id]['S3_signed_URL'] = message
ffprobe = subprocess.run(['/usr/bin/ffprobe', '-loglevel', 'error', '-show_streams', s3_source_signed_url, '-print_format', 'json'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if ffprobe.returncode == 0:
message = 'Success'
steps_message[movie_id]['FFProbe_analysis'] = message
try:
response = dynamodb_client.update_item(
TableName='my_movies',
Key = {
'movie_id': {
'N': movie_id
}
},
UpdateExpression="set ffprobe_output=:f",
ExpressionAttributeValues={
':f': {'S': ffprobe.stdout.decode('utf-8')}
}
)
message = 'Success'
except:
message = 'Failure'
error = True
steps_message[movie_id]['DDB_insert'] = message
else:
message = 'Failure'
error = True
steps_message[movie_id]['FFProbe_analysis'] = message
if error:
statusCode = 500
else:
statusCode = 200
return {
'statusCode': statusCode,
'body': json.dumps(steps_messages)
}
Let’s test
- Insert the movie details: a unique ID, a title, an Amazon S3 bucket used to store the asset, and the Amazon S3 object name.
In the following example, we used Python boto3 module to connect to Amazon DynamoDB and insert the data:
import boto3
dynamodb = boto3.resource('dynamodb')
def put_movie(movie_id, movie_title, S3_bucket, S3_object):
table = dynamodb.Table('my_movies')
response = table.put_item(
Item={
'movie_id': movie_id,
'movie_title': movie_title,
'S3_bucket': S3_bucket,
'S3_object': S3_object
}
)
return response
if __name__ == '__main__':
response = put_movie(1337,"Agent 327,Operation barbeshop - Trailer", "my-movies-bucket","Agent_327-Operation_Barbershop-Trailer.webm")
print(response)
You may receive something like this:
{'ResponseMetadata': {
'RequestId':
'AREQUESTID',
'HTTPStatusCode': 200,
'HTTPHeaders': {
'server': 'Server',
'date': 'Tue, 09 Feb 2021 13:37:00 GMT',
'content-type': 'application/x-amz-json-1.0',
'content-length': '2', 'connection': 'keep-alive',
'x-amzn-requestid': 'ANAMAZINGREQUESTID',
'x-amz-crc32': '1234567890'},
'RetryAttempts': 0
}
}
- Check whether your asset is present in the Amazon DynamoDB table:
After taking a few seconds to refresh (depending on asset size), a new column ffprobe_output is available:
Our Lambda function did the job! The asset details are now available in the ffprobe_output column in JSON format.
Memory and duration considerations
We made some tests with four different files, ranging from about 20 MB to about 790 MB, to find out the Lambda memory usage:
Asset size | Lambda duration | Lambda memory usage |
18.8 MB | 186 ms | 117 MB |
109.5 MB | 260 ms | 122 MB |
643.2 MB | 816 ms | 146 MB |
793.3 MB | 224 ms | 146 MB |
According to our tests, memory usage does not follow file size. This is caused by ffprobe reading metadata at the beginning of the file and does not need to read the whole file. This makes that maximum source file is not limited by the Lambda function memory limit (10 GB currently).
The amount of data downloaded is limited to the source file metadata only, so duration depends only on transfer rate to download the assets from Amazon S3 bucket plus the ffprobe runtime, which is very short.
We advise that you use gateway virtual private cloud (VPC) endpoints for Amazon S3 and Amazon DynamoDB to lower the latency. This will also increase the reliability because the data will travel only on the AWS backbone. By keeping your asset content in the AWS network, you can lower the risk of your data being intercepted. Additionally, there are no data transfer charges between Amazon S3 and any AWS service for transfers that occur in the same AWS Region.
You can check your Amazon CloudWatch logs to confirm the real duration and memory consumption of your assets and adjust them to your needs.
These logs also show a second initiator of the Lambda function for each asset. Our Lambda function updates the Amazon DynamoDB table, causing the Amazon DynamoDB stream to send a new notification to AWS Lambda function. Because our Lambda function is filtering the event name (only handling the “insert”), this is expected, is very limited in cost (less than 10 ms duration), and does not cause an infinite loop.
These tests were performed in the AWS Europe (Ireland) Region and the AWS Europe (Paris) Region (eu-west-1 and eu-west-3). The required services (Amazon S3, AWS Lambda, and Amazon DynamoDB) are deployed in all Regions, so these instructions should work in any Region.
Conclusion
We created an automated workflow that is invoked dynamically by adding a new asset into an Amazon DynamoDB table. This workflow will read the asset from its Amazon S3 location, extract metadata, format the metadata into a JSON structure, and update the asset Amazon DynamoDB entry with this JSON structure.
You can now deploy this workflow into your AWS account and start scanning your assets.
Find out how to make Amazon DynamoDB work for you!