AWS for M&E Blog
Unlock the value of Media Archives with AWS
Efficiently managing vast media archives remains a significant challenge for content owners, particularly when faced with insufficient metadata for quick asset discovery and reuse. Traditional approaches to archive enrichment—whether through human effort and/or machine learning (ML)—often prove prohibitively expensive or time-consuming.
In the following, we describe an innovative approach combining generative AI with ML to significantly reduce costs and enhance the efficiency of media asset enrichment and management. While our examples focus on news archives, the same strategies may be applied to other content types.
Introduction
The wide range of events a news organization covers is reflected in their content library: political events, wars, interviews, accidents, crimes, segments about the economy, health, celebrities, weather, sports teams…a growing list of events and stories that capture our history and inform our present-day. News organizations rely on their content libraries to provide context to current events:
- The new political candidate: What did (s)he say about immigration in previous speeches and interviews?
- The incoming “storm of the decade”: How was the community impacted during the last major storm?
- The athlete that was just drafted to a top team: Do we have good footage from his or her high school games?
Unfortunately, for every recording of a pivotal speech, there’s likely footage of an empty podium. The incredible storm footage from twenty years ago may be submerged in a sea of mundane weather B-roll. For every incredible shot of a high school star making a goal, there are multiple nondescript wide-angle clips.
Metadata is an essential pillar in content search and discoverability. The fast-paced nature of news production often leads to sparse metadata. It’s not unusual to see XDCAM Discs labeled “Storm 07/03/04”, files named “CRASH_REC”, and legacy database exports with timecodes and cryptic comments. Some news organizations have staff dedicated to wrangling metadata from various sources and logging clips. Some organizations are now leveraging AI/ML to help automatically generate metadata. These approaches may provide great ROI for high-value content, but how do we sort through all the chaff, so we’re not wasting time and resources logging empty podium shots or unlikely-to-be-used B-roll?
The following strategy will help optimize costs, while enriching archived and future content, within your content libraries.
Strategy
We’ll focus on optimizing costs for both the metadata enrichment and ongoing storage of assets. We will minimize metadata enrichment costs by leveraging existing metadata and applying the services that will provide the best value for a particular asset. Furthermore, we will discuss how the anticipated usage determines the storage tier.
Figure A illustrates an example workflow where:
- Content is digitized and the video file if uploaded to Amazon Simple Storage Service (Amazon S3).
- A composite image grid is generated from the video. If a transcript doesn’t already exist, one is generated with Amazon Transcribe. Both the image grid and transcript are sent to Anthropic Claude in Amazon Bedrock—using generative AI to provide a low-cost contextual analysis. Later we’ll demonstrate the tremendous amount of metadata that Anthropic Claude Haiku in Amazon Bedrock can cost-effectively provide.
- Business logic is applied to the contextual analysis to classify assets into one of four tiers. Appropriate enrichment (done by Amazon Rekognition) and storage policies are applied to each tier or content.
In this example, gold and silver-tiered content have additional enrichment applied and are stored in more readily-accessible storage. We’re simplifying this example by having the Classification Tier determine both the additional enrichment and storage class to be used.
Customers’ environments can have different tier classifications for enrichment and storage. This allows customers to better accommodate the range of assets they are dealing with. Assets that are deemed valuable enough to justify additional enrichment may not need to be available for retrieval within milliseconds. Additional details about enrichment and storage strategies are provided in the following sections.
Operational Design
Varying levels of metadata can be extracted from on screen elements such as people, scenes, on screen text, and graphics. It is also important to leverage existing metadata, including labels on film canisters or tape, segment guides, and transcripts, which often contain a high information density. The video and audio of the asset is analyzed using Amazon Bedrock large language models (LLMs) for near real-time evaluation.
Analyzing Audio
Amazon Transcribe is an automatic speech recognition (ASR) service that generates a transcript from the audio dialogues of an asset, such as the news in our example. If a transcript exists, this step can be skipped, and the existing transcript can be leveraged for analysis.
Analyzing Video
To generate a contextual response using foundational models (FM) from Amazon Bedrock, it is important to align the video and audio data before sending it to the FM for analysis. Using AWS Elemental MediaConvert Frame Capture an output of a composite image grid is created from the video frames to prepare the input for analysis. The composite image grid is composed of a sequence of the frame images extracted from the video. The sequence of the frame images allows us to instruct the FM to understand the temporal information of the video.
The composite images, transcript, taxonomy definitions, and other relevant information (such as news and broadcast classification) are then presented as a single query to Anthropic Claude 3 Haiku in Amazon Bedrock. LLMs can cost-effectively analyze and summarize content and generate new metadata that can be used for classifications.
Submitting a single prompt, with multiple questions, to the LLM allows for a summarizing of the video into a concise description based on the asset’s audio and video. For news assets we have also used the Interactive Advertising Bureau (IAB) classification to gain additional contextual information. This enhances understanding, search and discovery, and provides the necessary context to allow for media management.
Contextual information that can be gained for news assets include, but are not limited to:
- Description
- Anchors and reporters
- Broadcast date
- Show
- Topics
- Statistics
- Themes
- Notable quotes
- On screen celebrities and personalities
- News classification
- Broadcast classification
- Technical cues
- Language
- Brands and logos
- Relevant tags
This approach can be adapted based on the content type.
As shown in the output of the generative AI contextual analysis (Figure C), using the LLM to analyze media assets provides detailed descriptive metadata which can be used to enhance search and discovery and provide the necessary information for asset tiering.
Video Classification
Based on the output of the generative AI contextual analysis and business logic, further content analysis may or may not be necessary to extract time-series based metadata.
This strategy can enhance the search and discovery of content, and also drive efficiencies within a growing content repository. With a growing content repository, it is recommended to implement a media management strategy so content can be stored across various tiers of storage.
Based on a news organization’s business requirements, there may be a need to keep content generated after a certain date, or aligned to current events, stored in Amazon S3 Glacier Instant Retrieval. This storage tier is designed for rarely accessed data that still needs immediate access with retrievals in milliseconds. In contrast, B-roll footage or segment content from over 20 years ago can be stored in lower-cost Amazon S3 Glacier Deep Archive, where retrieval time is within 12 hours.
Identifying People in News
Although the LLM can identify celebrities, such as nationally known broadcast reporters, there are a number of people that may not be identified. Two approaches can be used to identify people in the news: the Amazon Rekognition RecognizeCelebrities API, and the IndexFaces and SearchFaces APIs.
The Amazon Rekognition celebrity recognition API is designed to automatically recognize celebrities and well-known personalities in images and videos using machine learning. However, there are often cases where local celebrities (such as news anchors, meteorologists, and field correspondents) are not identified by the celebrity recognition API.
In such cases, Amazon Rekognition offers face “Collections”, which are used for managing information related to faces. Unknown faces are stored as face embeddings in a managed collection and indexed using the IndexFaces and/or SearchFaces API, where a custom collection can be created to store each unique face. Unknown faces are compared against the existing index, and if they have not been captured previously, are added to the index using IndexFaces and SearchFaces APIs. This custom face collection then acts as a discoverable database for face vectors.
In testing news assets, roughly 15 distinct faces were observed per 30-minute news program.
Uncovering Efficiencies
Processing at Fixed Compared with Dynamic Intervals
Processing at Fixed Intervals
Processing at a fixed frame rate (for example, one frame per second or one frame every two seconds) can certainly capture a large amount of metadata. However, it may come with trade-offs in terms of cost of computer vision API calls. The advantages of our strategy includes comprehensive metadata capture, capturing more on-screen text, and full correlation between visual metadata and the transcript. The following image (Figure D) is an example of processing at a fixed frame rate.
Dynamic Frame Analysis
Configuring the metadata enrichment framework, to process news media assets dynamically, reduces cost by limiting the number of frames sent to Amazon Bedrock and Amazon Rekognition. The solution measures the Hamming distance between perceptual hashes created for each extracted image frame to decide when a frame has changed significantly. This approach only calls Amazon Rekognition APIs when visual frame changes indicate the need for new analysis.
While this method reduces the cost of computer vision APIs, detecting text with this approach may result in missed on-screen text between API calls.
To provide a general reference point, we conducted tests on news segments varying from five minutes to one hour. In testing, an average in reduction by 83% in API calls was reached with this approach.
Additional testing results:
- High density content, such as sports highlights and war footage, saw roughly a 70% reduction in Amazon Rekognition Image API calls.
- Low density content, such as press conferences and one-on-one interviews, saw roughly a 90% reduction in Amazon Rekognition Image API calls.
Media Management and Storage Tiering
Further efficiencies are realized by tiering assets. Using the contextual information derived from the asset, we can implement a more efficient media management and enrichment strategy.
Pricing Breakdown
To provide a general reference point, we will use the referenced 5 minute 36 second news segment clip (shown in Figures D and E) to map out pricing for each classification tier. The size of the video used is 12.3 GB. It is important to note that during the Metadata Enrichment stage, dynamic frame analysis was used. This created 668 frames being extracted from the video, with only 147 frames used in the analysis. This resulted in a 78% reduction in Amazon Rekognition Image API calls.
As shown by the example cost scenarios, the proposed approach provides cost-savings for both metadata enrichment and ongoing storage costs. Based on different content types, storage tiers and the level of analysis may vary, all effecting the total cost. It is important to note that we do not recommend extrapolating the analysis cost based on content length.
For current AWS product and services pricing, please see https://thinkwithwp.com/pricing/ .
Conclusion
Organizations want to ensure that archived assets are not only preserved, but also utilized to their full potential. By following the strategy we’ve described, organizations can significantly reduce the costs associated with enriching and storing archival content, while maintaining high standards of accessibility. This optimizes the management of vast content repositories, and also empowers organizations to uncover new opportunities for content discovery, reuse, and potential monetization.
Contact an AWS Representative to know how we can help accelerate your business.
Visit the following links, to learn more about additional media and entertainment industry use cases: