Media content localization with AWS AI services and Amazon Bedrock

This blog post walks through a content localization approach where English is translated to Korean.

In media production, subtitling plays a crucial role in augmenting viewer comprehension by providing transcriptions that facilitate translation from one language to another. This capability has helped expedite the globalization of media content. For example, the popular Netflix show Squid Game became accessible to a broad demographic through effective subtitling. Subtitles are vital in fostering connectedness between viewers and media, bridging linguistic barriers.

This blog post describes two methods to translate subtitles: The sequential process of translation and refinement using artificial intelligence (AI) services from Amazon Web Services (AWS) and prompt engineering with Amazon Bedrock. In addition, we explain the automated process of generating short video clips based on subtitle files.

Introduction and high-level flow

Amazon Transcribe is a speech-to-text (STT) service from AWS that generates a WebVTT file from a sample MP4 file (What is Amazon Transcribe_.mp4). For the purposes of this blog post, it is not necessary to use the same MP4 file.
Amazon Translate supports the following file formats for input data: Plain text (.txt); HTML (.html); Microsoft Word (.docx), Excel (.xlsx), PowerPoint (.pptx); and XLIFF 1.2 (.xlf). In this case, WebVTT subtitle files generated from Amazon Transcribe must be converted into plain text (.txt) format for translation.
As it is not feasible to directly translate SRT(SubRip) or WebVTT files with Amazon Translate, preprocessing steps are implemented to optimize translation outcomes.
The steps for processing WebVTT subtitle files to enhance visibility are executed on an Amazon SageMaker Studio Notebook instance, as depicted in Figure 1.

Workflow diagram showing transcription and translation process. Steps: 1. Download media from S3, 2. Input MP4 to Amazon Transcribe, 3. Extract WebVTT, 4. Parse text in SageMaker Notebook, 5. Translate with Amazon Translate, 6. Aggregate and store in S3,

Figure 1. Processing WebVTT subtitle files

Use Amazon Bedrock to directly translate subtitles with prompt engineering techniques.

Use Amazon Bedrock to generate concise video clips incorporating the WebVTT file.

Summary of steps

[Subtitle translation with AWS AI services]

Step 1: Create WebVTT file using Amazon Transcribe
Step 2: Process the WebVTT file with Amazon SageMaker Studio
- Convert the WebVTT file to a text file and perform basic processing
- Initiate an Amazon Translate job
Step 3: Preprocess the translated WebVTT file in text format to enhance readability

[Utilizing Amazon Bedrock]

Step 4: Use Amazon Bedrock to directly translate the WebVTT file through prompt engineering
Step 5: Delineate a methodology to generate a concise video clip using Amazon Bedrock

Prerequisites

The following prerequisites are needed for this walk through:

An AWS account
Amazon SageMaker Studio
- Grant the Execution Role access to services such as Amazon S3, Amazon Polly, and Amazon Bedrock
A new Amazon S3 bucket
Amazon Bedrock access to Claude 3

Step 1: Create a WebVTT file using Amazon Transcribe

Start by using Amazon Transcribe to generate a WebVTT file from a sample video file.

Within the newly created Amazon S3 bucket, create a folder and upload the sample video file to that location. (What is Amazon Transcribe_.mp4)

Upload an Mp4 file(WhatisAmazonTranscribe.mp4) on Amazon S3 you created.

Figure 2. Upload an mp4 file.

Within the Amazon Transcribe service, initiate the ‘Create Job’ process and specify the Input and Output path as depicted in Figure 3. For the Output path, select the ‘Customer specified’ option and designate a new folder to store the generated WebVTT file. In this case, the folder ‘TranscribeVTT’ is designated.

AWS Transcribe output configuration screen. Select Customer-specified S3 bucket and set S3 URI as {your bucket} /TranscribeVTT/‘. Choose Subtitle file format as VTT (WebVTT).

Figure 3. Create a new folder to save WebVTT file.

Retain the default settings for the remaining configuration options and then create the job.
When the job Status is ‘Complete,’ validate that the WebVTT file has been successfully generated and stored in the designated Amazon S3 location.

Step 2: Process the WebVTT file with Amazon SageMaker Studio

Access the Amazon SageMaker Studio service and instantiate a new notebook instance. Specify ‘Python 3’ as the kernel, retain the default instance type configuration, and proceed with the creation process.

Screenshot of AWS SageMaker notebook setup interface. Shows configuration with Data Science 3.0 image, Python 3 kernel, empty instance type selection, and no start-up script.

Figure 4. Set up the notebook environment.

To facilitate an uninterrupted workflow, execute the following code snippet.

!pip install webvtt-py
!pip install -U boto3 botocore

The subsequent code snippet facilitates an understanding of the structural composition of the WebVTT file.

#Check the structure of the VTT file
import boto3
s3 = boto3.client('s3')

bucket_name = ‘{BucketName}’
key_name = '{FolderName/webVTT file}’

obj = s3.get_object(Bucket=bucket_name, Key=key_name)
text = obj['Body'].read().decode('utf-8') 

print(text)

The WebVTT file is converted to one of the file formats supported by Amazon Translate, specifically the text (.txt) format in this case. The modified text file is saved as ‘textconverted.txt’ within a newly created directory named ‘txt’.

#Convert the VTT file to a TXT file

import webvtt
import io
s3 = boto3.client('s3')

# S3 locations
bucket = ‘{BucketName}'
vtt_key = ‘{FolderName/webVTT file }’
txt_key = '{txt/textconverted.txt}'

# Download WebVTT file
vtt_obj = s3.get_object(Bucket=bucket, Key=vtt_key)
vtt_text = vtt_obj['Body'].read().decode('utf-8')

# Parse WebVTT and extract text
vtt = webvtt.read_buffer(io.StringIO(vtt_text))
text = []
for line in vtt:
    if line.text:
        text.append(line.text)
text = '\n'.join(text) 

# Upload text file to S3
s3.put_object(Body=text, Bucket=bucket, Key=txt_key)

As shown in Figure 5, if translation is performed directly on the current WebVTT file structure, the meaning could be diluted or incorrectly conveyed. This is a common occurrence due to structural differences in sentence placement across languages. In the example of Figure 6, the Korean text is translated incorrectly.

Text structure of the original WebVTT file that is translated into text file format.

Figure 5. Text structure of the original WebVTT file.

Result of Korean translation – due to structural differences in sentence placement across languages, translation quality is affected.

Figure 6. Result of Korean translation – Poor quality with mistranslation.

After the following processing steps, the completed sentences is stored in an S3 bucket.

#Replace the "\n" separators with spaces.
import boto3
s3 = boto3.client('s3')

bucket_name = ‘{BucketName}’
key_name = 'txt/textconverted.txt'

obj = s3.get_object(Bucket=bucket_name, Key=key_name)
text = obj['Body'].read().decode('utf-8') 

#Combine the separated words to form a single sentence.
text = text.replace("\n", " ")
print (text)

#Save the modified text file
new_key = 'txt/ new/new_textconverted.txt'
s3.put_object(Body=text, Bucket=bucket_name, Key=new_key)

[Example]

Python code using boto3 to retrieve and process text from an S3 bucket. The code reads a file named 'textconverted.txt', decodes it, replaces newlines with spaces, and prints the resulting text. The printed text describes features and applications of Amazon Transcribe.

Figure 7. Result of the processed original text.

Perform the translation of the text file using Amazon Translate by executing the following statement.

#Execute a translation job on the pre-processed TXT file using Amazon Translate.
client = boto3.client('translate')
response = client.start_text_translation_job(
    JobName='TextTranslation',
    InputDataConfig={
        'S3Uri': '{The URI (Uniform Resource Identifier) for the new Text file.}',
        'ContentType': 'text/plain'
    },
    OutputDataConfig={
        'S3Uri': 's3://{ BucketName}/translated',
    },
    DataAccessRoleArn='{Role ARN(Refer to the role when creating it in the console.}',
    SourceLanguageCode='en',
    TargetLanguageCodes=[
        'ko',
    ], 
)

[Example]

Python code snippet using boto3 to execute an Amazon Translate job. The code configures input and output S3 buckets, sets the source language to English and target language to Korean, and specifies the IAM role for data access.

Figure 8. Execute Amazon Translate Job using the processed original text.

Ultimately, you can verify the translation results to ensure no loss of meaning or mistranslations occurred. Next, let’s explore the process of converting this translated text file back into WebVTT format.

Python code using boto3 to retrieve translated text from an S3 bucket. The code specifies the bucket name 'stt-taehooon' and a key for the translated file. Below the code is the resulting Korean translation discussing audio transcription services.

Figure 9. Result of Korean translation using the processed original text.

Step 3: Preprocess the translated text file in WebVTT format to enhance its readability

While the previous step involved preprocessing the English WebVTT file into a text file for optimal translation, this step focuses on properly placing the translated sentences back into the original WebVTT timeline. This blog post introduces two methods for creating the translated WebVTT file.

Method 1: Tokenize the entire sentence iteratively

The first method involves recognizing sentences in the original English text that end with a period as single sentences. The translated sentence is then repeatedly displayed for the duration of that sentence’s timeline. This allows viewers to understand the context by seeing the translated subtitles while the original text file plays.

[Example]

00:00:01.970 –> 00:00:06.119Transcribing audio can be complex, time consuming and expensive.
00:00:06.329 –> 00:00:10.800You either need to hire someone to do it manually implement applications that are
00:00:10.810 –> 00:00:15.710difficult to maintain or use hard to integrate services that yield poor results.

As depicted in the previous example, text from 06.329 to 15.710 completes a sentence and displays the entire translated sentence during that timeline repeatedly. However, the flow of the sentences may not look natural.

This is a MP4 clip example that illustrates entired translated sentences are placed iteratively at the token positions of the original sentence.

Figure 10. Place the entire translated sentence iteratively at the token positions of the original sentence.

Method 2: Redistribute the entire sentence

The second method involves redistributing sentences as needed to maintain the overall flow and context.

The following are three approaches for redistributing the subtitle file based on different conditions:

When a single sentence fits into a single timeline, translate the entire subtitle and position the translated sentence accordingly.
If an entire sentence is divided into multiple subtitle timelines, increase the lag value to adjust the index of the translated sentence to match the timeline. Then, use the ‘divide_sentence()‘ function to split the sentence according to the size of the used subtitle (buffer) and position the subtitles accordingly.
If a single subtitle contains multiple sentences, increase the lead value to adjust the index of the translated sentences. Then, position the translated multiple sentences within their respective subtitles.

To conduct Method 2, three pieces of data are required:

Original (English) VTT data
Original (English) text list
Translated (Korean) text list (translated using Amazon Translate)

def divide_sentence(a, n):
k, m = divmod(len(a), n)
return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))origin_sentences = orig_text.split(".")
translated_sentences = trans_text.split(".")buffer, lag, lead = 0, 0, 0origin_vtt_sentences = orig_vtt.split('\n\n')[1:]
org_sent = []
results = []for idx, sent in enumerate(origin_vtt_sentences):
no, time, sent = sent.split('\n')cur_lead = max(len(sent.split('.')) - 2, 0)
lead += cur_leadmatched_sentences = [t.strip() for t in translated_sentences[idx-lag + lead:idx-lag + lead + cur_lead+1]]
target_sentence = " ".join(matched_sentences)
results.append(
{'no': no, 'time': time, 'origin': sent, 'translated': target_sentence}
)
if sent.endswith("."):
if buffer > 1:
sentences = list(divide_sentence(target_sentence.split(' '), buffer))
for idx, sent in enumerate(sentences):
results[-buffer+idx]["translated"] = " ".join(sent)
buffer = 0
else:
lag += 1
buffer += 1

Result example

The image shows a JSON data structure containing transcription and translation information after rearranging subtitles based on the timeline. It includes timestamps, original English text, and Korean translations for several segments of speech.

Figure 11. Rearrange subtitles based on the timeline.

With this method, the translated webVTT subtitles can be positioned appropriately. Following is an example of a Korean VTT subtitle.

00:00:01.970 –> 00:00:06.119오디오 텍스트 변환은 복잡하고 시간이 많이 걸리며 비용이 많이 들 수 있습니다
00:00:06.329 –> 00:00:10.800유지 관리가 어려운 애플리케이션을 수동으로 구현할 사람을 고용하거나, 통합이
00:00:10.810 –> 00:00:15.710어려운 서비스를 사용하여 결과가 좋지 않을 수 있습니다

This is a MP4 clip example of what is explained in Figure 11.

Figure 12. Rearrange subtitles based on the timeline (clip example).

Step 4: Use Amazon Bedrock to directly translate the WebVTT file through prompt engineering

As the structure of subjects, objects, and verbs varies across languages, directly translating the original text can lead to subtitles that lack fluency or fail to capture the viewer’s attention. In such cases, long context prompt engineering techniques can be employed to directly translate subtitle files, like WebVTT and SRT.

Provide large language model (LLM) a persona and conditions that it must adhere to
Then, put the target subtitle file before the condition and query
Target data must be in XML tags so it’s clearly separated from the instructions

[Query example]

You are a master of subtitle translation. Below is a webVTT formatted subtitle that you must work on.

<subtitle>

00:00:10.810 –> 00:00:15.710difficult to maintain or use hard to integrate services that yield poor results.

</subtitle>

Condition: 1. Leave the timeline and translate English part to Korean. 2. Make it smooth from the audience’s perspective. 3. Do not provide extra information besides subtitle.

[Result example]

<subtitle>

00:00:01.970 –> 00:00:06.119오디오 전사 작업은 복잡하고 시간이 많이 걸리며 비용이 많이 듭니다.

00:00:06.329 –> 00:00:10.800전사를 수작업으로 맡기거나 유지보수가 어려운 애플리케이션을 개발하거나

00:00:10.810 –> 00:00:15.710통합하기 어렵고 결과물 품질이 낮은 서비스를 이용해야 합니다. </subtitle>

Step 5: A methodology to generate a concise video clip using Amazon Bedrock

When using Amazon Transcribe, subtitle timelines ranging from a few dozen lines to several hundred lines can be generated. By leveraging Claude, Anthropic’s latest large language model (LLM), on Amazon Bedrock, you can extract subtitle files that specifically highlight key moments throughout the entire video.

Following are the steps to create a video summary or short clips:

Use prompt engineering to extract key moments from the webVTT files.

a. Documentary example: “Extract the important parts from the video.”

b. e-commerce example: Persona: You are a professional short-clip editor. Task: “Select the highlight where the speaker accentuates the advantages of the product.”

Use ‘re’ module to extract time ranges of a summarized webVTT file.

Python code snippet showing extraction of time ranges from summarized text. The output displays a list of time range tuples in HH:MM:SS.mmm format, representing various segments of audio or video content.

Figure 13. Extracted time ranges of webVTT file using Amazon Bedrock.

Use the ‘MoviePy’ library to concatenate the video clips and build an MP4 file.
Upload the MP4 file to S3 if it needs to be further transcoded using AWS Elemental MediaConvert.

Conclusion

This blog post demonstrated how to use AI services from AWS to automate and streamline media content localization. Builders can use Amazon Transcribe and Amazon Translate and add preprocessing procedures to improve translation quality. By leveraging Amazon Bedrock with prompt engineering, direct translation of subtitles is possible, but human review in a loop is required. Furthermore, video editors can generate a short clip using a combination of the subtitling context and LLM.

AWS for M&E Blog