AWS for Industries
Accelerate data access for your AV/ADAS applications using Mountpoint for Amazon S3
In the realm of autonomous vehicles and advanced driver assistance systems (AV/ADAS), automotive original equipment manufacturers (OEMs), Tier 1 suppliers, and independent software vendors (ISVs) all process vast amounts of data recorded by test vehicles in the field. That recorded data can reach up to 300 TB per vehicle per day. When the recorded data is uploaded to Amazon Simple Storage Service (Amazon S3)—object storage built to retrieve any amount of data from anywhere—it is processed by downstream applications. One of the key steps in that process is the extraction of metadata from recordings. Some examples of recording metadata are identified objects, cars, pedestrians, and traffic signs that result from the image extraction jobs that are run over each recorded video image. Then, this recording metadata helps make it simple for engineers to search, test, validate, and re-simulate the recorded data while focusing on the most relevant parts. Recording metadata includes time stamp markers that help developers navigate to specific sections of the recordings, helping facilitate a more efficient analysis and development of the advanced systems.
Developers accessing those recordings usually store copies on their local machines, virtual machine drives, or other remote compute environments. Local copies are often needed because processing applications are usually not designed for direct use of block storage or interaction with the API for Amazon S3. Although that approach is technically correct, it presents a disadvantage: developers must fully download data snippets—adding up to dozens of gigabytes or even terabytes—to machines or servers, which means time is spent before the processing of the data even begins. From a developer’s perspective, visualizing a recording with that approach will mean waiting for the full data to be downloaded before playback can start. Another disadvantage with this approach is the need for local drives with sufficient capacity for temporary storage. That means that customers may need to overprovision local storage with sufficient network bandwidth to handle the large amount of data, which can result in additional costs. In sum, developers need to invest time and resources in receiving the data before doing any of the actual processing.
Mountpoint for Amazon S3, a generally available solution backed by support from Amazon Web Services (AWS), helps customers address some of the above challenges. Mountpoint for Amazon S3 empowers your developers to focus on algorithm development rather than data access. It simplifies the developer experience and eliminates the need for downloading local copies by translating local file system operations into REST API calls on Amazon S3.
In this blog post, we present two examples that illustrate the use of Mountpoint for Amazon S3 alongside ready-to-run code that you can test yourself. In each example, we run the code both with and without Mountpoint for Amazon S3 and compare the results. The code modifications necessary for Mountpoint for Amazon S3 are minimal and involve just a few lines. Finally, we discuss performance metrics that showcase the improvements you may be able to achieve from using Mountpoint for Amazon S3.
Mountpoint for Amazon S3 in action
Mountpoint for Amazon S3 is an open-source file client that delivers high-throughput access to Amazon S3, which can help customers lower their processing times and compute costs for data lake applications. Mountpoint for Amazon S3 translates local file system API calls to Amazon S3 object API calls such as GET and PUT. Mountpoint for Amazon S3 is ideal for workloads that read large datasets (terabytes to petabytes in size) and require the elasticity and high throughput of Amazon S3. Mountpoint for Amazon S3 is backed by AWS support, which means that customers with AWS Business Support or AWS Enterprise Support are able to get 24/7 access to cloud support engineers. For more details, see the Mountpoint for Amazon S3 documentation.
Let’s discuss two examples that show the potential benefits of Mountpoint for Amazon S3 in real-world scenarios:
1. For a first example, we calculate the SHA-1 hashes on large files stored in Amazon S3. Although not automotive-specific, this straightforward use case involves a process that performs a read operation on a large file. We first use the Amazon S3 copy and hash approach. In that approach, we copy the file from the Amazon S3 bucket to our local file system and then calculate the hash using the local copy. Then we use Mountpoint for Amazon S3 to mount the same Amazon S3 bucket to the local file system. Finally, we calculate the same file hash directly from the mounted Amazon S3 bucket and compare the run times of the two approaches. Using Mountpoint for Amazon S3, we observed an improvement in hash calculation time of up to 5 times.
2. In the second example, we consider an AV/ADAS-specific use case: extracting images from large files produced by a vehicle video recording. The recordings were created using the dSPACE AUTERA logger and stored in the dSPACE RTMaps format. We show how using Mountpoint for Amazon S3 can improve image extraction time in this use case by a factor of 17.
The following steps walk you through how to explore both scenarios using your own AWS account.
Example 1: Calculation of an SHA-1 hash for a file stored in Amazon S3
To familiarize you with how to use Mountpoint for Amazon S3, we’re starting with a clear example. We calculate the SHA-1 hash of large rosbag files (with a file size larger than 5 GB) that are stored in Amazon S3. (Note that you could use any large file for this example.) The following two approaches are used:
1. Approach – Amazon S3 copy and hash:
We calculate the SHA-1 hash for a file that resides on the local storage of our compute environment. That means that we first download the file to the local storage of our run environment and then calculate the SHA-1 hash. We call this the Amazon S3 copy and hash approach. The following diagram shows the architecture:
Figure 1. The Amazon S3 copy and hash approach
2. Approach – Amazon S3 mount and hash:
We calculate the SHA-1 hash for the same file using Mountpoint for Amazon S3. That means that we mount the Amazon S3 bucket on our local file system and access the file through the mount. We call this the Amazon S3 mount and hash approach. The following diagram shows the architecture:
Figure 2. The Amazon S3 mount and hash approach
We then compare the run time of the two different approaches and show the performance gain achieved by using Mountpoint for Amazon S3.
Prerequisites
To replicate this walk-through, you will need the following:
- An AWS account with administrator access.
- An Amazon S3 bucket in that AWS account with a file larger than 5 GB. In our demo setup, we are using the Amazon S3 bucket shown below.
Figure 3. The sample Amazon S3 bucket used in this blog post
- A Linux environment in which to run Mountpoint for Amazon S3. For the purpose of this demo, we use a Linux instance of Amazon Elastic Compute Cloud (Amazon EC2), secure and resizable compute capacity for virtually any workload.
Make sure that you are using a role that has sufficient Amazon S3 permissions attached to your Amazon EC2 instance so that the command aws s3 cp can run successfully.
Approach 1: Calculating SHA-1 hash of a large file using the Amazon S3 copy and hash approach
The first approach involves two steps:
1. Copy the file from Amazon S3 to the local file system of your Amazon EC2 instance.
2. Calculate the SHA-1 hash on the local file copy.
We use the following bash script to calculate the SHA-1 hash for our demo file, file-16gb.bag. Copy and paste the following script and save it as ‘sha1-hash.sh’ on your Amazon EC2 machine. Make sure to replace the place holders [YOUR_S3_BUCKET_NAME] and [YOUR_FILE_NAME] with your actual bucket and file names.
Once the placeholders are populated with the right values, we run the script for the file file-16gb.bag. The output is shown below:
As shown, the Amazon S3 file download and SHA-1 hash calculation took 82194 milliseconds (82.2 seconds). Repeat the process for as many files of different size as you like. Note the run time for each file. In our example Amazon S3 bucket, the results are as follows:
Table 1. SHA-1 hash calculation times using the Amazon S3 copy and hash approach
Approach 2: Calculating SHA-1 hash of a large file using Mountpoint for Amazon S3
Now we want to calculate the same SHA-1 hashes, but this time using Mountpoint for Amazon S3. The process will include the following two steps:
1. Mount the Amazon S3 bucket in which the file is stored using the Mountpoint for Amazon S3 client.
2. Calculate the SHA-1 hash directly on the mounted object without locally copying/downloading the object.
Before we can start, we need to install Mountpoint for Amazon S3 on our system by following the Mountpoint for Amazon S3 README.md. We recommend that you use an Amazon EC2 Linux instance (x86_64) for a standardized run environment. Once you have access to the console of your Amazon EC2 instance, use the following command to install:
Test whether installation has been successful by running mountpoint –help. You should see the following output:
Figure 4. Mountpoint for Amazon S3 after successful installation
Now we can mount our Amazon S3 bucket to our file system using the CLI:
Figure 5. Amazon S3 files mounted on the local file system
Now that we have the file(s) from Amazon S3 mounted into our file system, we can calculate the SHA-1 hash(es) without having to download anything.
We can start the hash calculation by using the following bash script to calculate the SHA-1 hash for our demo file, file-16gb.bag. Copy and paste the following script and save it as “sha1-hash-with-mountpoint.sh” in your environment. Make sure to replace the placeholder [YOUR_FILE_NAME] with your actual file name.
Once the placeholders are populated with the right values, we run the script for the file file-16gb.bag. The output is shown below:
As shown, the Amazon S3 file download and SHA-1 hash calculation took 82194 milliseconds (82.2 seconds). Repeat the process for as many files of different size as you like. Note the run time for each file. For our sample Amazon S3 Bucket the results are as follows:
Table 2. SHA-1 hash calculation times using Mountpoint for Amazon S3
Comparing results
The comparison of the results of the two different SHA-1 hash calculation approaches is shown in the following table:
Table 3. Performance increase in SHA-1 hash calculation using Mountpoint for Amazon S3
Table 3 shows that using Mountpoint for Amazon S3 can provide significant performance improvement, with speed improvement of up to 5 times. This occurs while helping save on local disk space because files don’t need to be downloaded. The following graph visualizes the results:
Figure 6. Performance increase in SHA-1 hash calculation speed using Mountpoint for Amazon S3
Cleaning up
To avoid incurring future charges, delete your generated Amazon S3 bucket. If you used an Amazon EC2 instance as your compute environment, make sure to end it.
Example 2: Image extraction from a large recording file stored in Amazon S3
In the context of ADAS, image extraction of large file recordings is a common task. In fact, such image extraction is often the first step in a complex ADAS workflow. The following image shows a typical ADAS workflow, with image extraction emphasized in the dotted orange box.
Figure 7. Simplified sample ADAS workflow
The simplified workflow shown in figure 7 involves the extraction of images, image anonymization, the creation of video snippets for visualization by users, and the extraction of recording metadata such as GPS data. As a further step in the workflow, inferences are run on those images to identify objects that later can be aggregated in a timeline such that the user can identify scenarios. Those scenarios can then run in a simulation engine and the results will be stored in Amazon S3.
One way of implementing this image extraction process is shown in the following image. Here, we use a scalable AWS-native approach. The source code of this implementation can be found in the repository of this blog post. Deployment instructions for the code are located in the README.md file. To deploy this, you need administrative access to an AWS account.
Figure 8. Detailed view of the image extraction process
The raw recordings are stored in an Amazon S3 bucket. As soon as a recording is uploaded to the Amazon S3 bucket, an event is triggered through Amazon EventBridge, a service that makes it simpler to build event-driven applications at scale. That event in turn starts a step function in AWS Step Functions, visual workflows for distributed applications. The step function then triggers a batch job in AWS Batch—a fully managed batch computing service—which does the image extraction frame by frame. Each extracted image is stored in the output Amazon S3 bucket.
In this blog post, we use raw dSPACE RTMaps recordings (IDX / JSQ files). The high-level overview of the dSPACE RTMaps recordings format is shown in the following. The JSQ file stores the individual images in a large binary. The IDX file stores pointers to the individual image files in the JSQ file. An image extraction process first reads the IDX file and then extracts images from the JSQ file by reading the individual image from the pointer location stored in the IDX file. A sample IDX / JSQ file can be found in the samples’ data folder of this blog post.
Figure 9. High-level overview of the dSPACE RTMaps recordings format
In the context of AV/ADAS, there are usually two image extraction patterns:
1. Access pattern – Linear full image extraction: Linearly extract all images from index 0 until the final index value.
2. Access pattern – Partial image extraction: Extract only individual images or a certain range of images stored in the JSQ file. This pattern is especially useful if you want to extract specific images or image ranges from a large recording for further investigation because you avoid extraction of the entire recording. For example, when doing a video playback, the user often knows the time stamp they want to start playback from, and it is advantageous to jump directly to a specific part of the recording.
The two access patterns are shown in the following figure for an example JSQ file with 100,000 stored images:
Figure 10. Full and partial image extraction approaches
In the following section, we measure the image extraction time for multiple dSPACE RTMaps recordings of different sizes for the partial image extraction approach.
Benchmarking the image extraction process using the Amazon S3 copy and extract and Amazon S3 mount and extract approaches
Now let’s assume that we want to do linear extraction for 10 percent of all images starting from the middle index of the JSQ file. Once again, we have two approaches at hand that we compare for performance at the end:
1. Approach – Amazon S3 copy and extract:
To extract the images, we first download the file to the local file system of our run environment. We then start the image extraction process. We call this the Amazon S3 copy and extract approach.
2. Approach – Amazon S3 mount and extract:
We mount the Amazon S3 bucket into the local file system of our run environment and access the file via the mount. When extracting the images, all file operations are performed on the mounted Amazon S3 file. At no point is the file stored on the local file system. We call this the Amazon S3 mount and extract approach. The image extraction script for this approach can be viewed in the ADDF image extraction module.
We ran both approaches for JSQ files of size 0.35 GB, 13 GB, 21 GB, and 35 GB. You can run the image extraction yourself for the Amazon S3 mount and extract approach using the sample JSQ file provided in the GitHub repository of this blog post. Just follow the detailed instructions in the README.md file. All that you need for deployment is administrator access to an AWS account.
Comparing results
The comparison of the results for different image extraction runs on JSQ files of different sizes is shown in the following table.
Table 4. Performance increase in image extraction using Mountpoint for Amazon S3
Table 4 shows that using Mountpoint for Amazon S3 provides significant performance improvement, with speed improvement of up to 17 times. This is done while saving on local disk space because no files need to be downloaded. The larger the file to be processed, the larger the performance increase because a larger file means more time is required for initial file download. In the case that a user wants to play back some data with a visualization tool, the data is accessed much quicker, resulting in a better experience for the user. The following graph visualizes the results:
Figure 11. Performance increase in image extraction using Mountpoint for Amazon S3
Cleaning up
To avoid incurring future charges, delete your deployed image extraction solution. For details, refer to the cleanup section in the README.md.
Conclusion
In this blog post, we showcased how to use Mountpoint for Amazon S3 to process large files without the need to download them. The first example involved a clear file hash calculation. The second example showed how to extract images from large recording files in a typical use case for AV/ADAS. By comparing the version that uses Mountpoint for Amazon S3 with the version that doesn’t, we demonstrate performance gains of up to 17 times, achieved without any additional optimizations. We hope you will explore this new approach in your code and invite you to share the results with us. Mountpoint for Amazon S3 can be applied to a wide range of applications and industries that involve file processing tasks.
We welcome your feedback on Mountpoint for Amazon S3 and any ideas you might have for features you would like to see in the future. For more information, refer to the Mountpoint for Amazon S3 GitHub repository.