AWS Machine Learning Blog

Optimize your budget and time by submitting Amazon Polly voice synthesis tasks in bulk

Amazon Polly is a service that turns text into natural-sounding speech, using dozens of voices in more than 30 languages. You can use it for all sorts of applications, ranging from talking animated avatars, to lifelike virtual agents that answer customer support requests, to automated newscasters reading stories aloud. You can have Amazon Polly return synthesized speech as a live stream, or download it as a standard audio file for playback later. Like many AWS services, you pay only for what you actually use: with Amazon Polly, you pay for the number of characters in the synthesized phrase. Just playing a saved audio file is free, whether you play it a single time or multiple times.

If you know exactly which phrases you need ahead of time, you can optimize your AWS spend. Just take every phrase you need voiced and send it to Amazon Polly at build time, storing the generated audio file until you’re ready to play it back at runtime. Common use cases for this approach include public address systems at airports or bus stations, video games, and quick-service restaurant automated order-takers. Just pay once to synthesize your text, and then replay the resulting audio files as needed for free.

In this post, we share a fully automated, event-driven, serverless solution that you can use to turn large numbers of text phrases to lifelike speech asynchronously. You can trigger the jobs by manually uploading a file of phrases to a private Amazon Simple Storage Service (Amazon S3) bucket, and then be notified by email or instant message when they’re ready. Or, make the process part of your AWS CodeBuild continuous integration system, by automatically triggering the synthesis work whenever your source phrases change.

Overview of the solution

The solution is fully serverless, consisting chiefly of a set of AWS Lambda functions. These functions track the items to be synthesized. Submit them to Amazon Polly for synthesis, and process the results as they’re completed. The functions use shared Amazon DynamoDB tables to manage the state of the work over time. An AWS Step Functions workflow tracks each submitted set, and notifies interested parties of its completion via an Amazon Simple Notification Service (Amazon SNS) topic.

The solution employs an event-driven architecture: rather than a single process running from beginning to end, the process is distributed across Lambda invocations, run only when triggered to do so from some event.

The following diagram illustrates the solution architecture.

Deploy and configure the solution

You deploy the solution into your AWS account using the AWS Serverless Application Model (AWS SAM). You can do this from any computer with command line access to your account, but for the sake of simplicity, we use AWS CloudShell.

  1. Sign in to the CloudShell console.
  2. When your shell has been initialized, make a local copy of the solution source code and prepare the AWS SAM stack by issuing the following commands:
$ git clone https://github.com/aws-samples/amazon-polly-async-batch.git
$ cd amazon-polly-async-batch
$ sam build
  1. Use AWS SAM to deploy the solution, with deploy –guided. Provide a stack name (like amazon-polly-async-batch), your preferred Region, an email address for notifications, and the name of a non-existent S3 bucket for the generated audio files. Accept the other defaults.
$ sam deploy --guided
        Setting default arguments for 'sam deploy'
        =========================================
        Stack Name [amazon-polly-async-batch]: 
        AWS Region [us-east-1]: 
        Parameter NotificationEmail []: *YOUR EMAIL ADDRESS*
        Parameter WorkBucket []: *YOUR WORK BUCKET NAME*
        #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
        Confirm changes before deploy [y/N]:  
        #SAM needs permission to be able to create roles to connect to the resources in your template
        Allow SAM CLI IAM role creation [Y/n]:  
        Save arguments to configuration file [Y/n]: 
        SAM configuration file [samconfig.toml]: 
        SAM configuration environment [default]: 

Deployment of all the components should take only a few minutes. If installation is successful, you should see a message like the following:

Successfully created/updated stack - amazon-polly-async-batch in us-east-1
  1. Check your email for a message from Amazon SNS and confirm the subscription.

How the solution works

In this section, we describe in detail how to use the solution to synthesize your text, and how each major component works.

The set file: Specifying the text to synthesize

You define the set of text phrases you want Amazon Polly to voice in a file named a set file. This is a YAML file consisting of the set details, a collection of defaults, and a list of items to synthesize:

  • Set details – In the set stanza, you give the set a name to differentiate it from others, and an optional output prefix to tell the solution where in your S3 bucket you want the audio files stored.
  • Defaults – In the optional defaults section, you can give parameters specific values that apply unless overridden by specific items. The following attributes are supported, as documented in the Amazon Polly API:
    • engine – Either standard or neural; defaults to neural
    • language-code – Any of the over 20 languages supported; defaults to en-US
    • output-format mp3, ogg_vorbis, or pcm; defaults to mp3
    • text-type – Either text or SSML; defaults to text
    • voice-id – Any of the supported voices; defaults to Matthew
  • Items – The items collection is simply a list of text strings to synthesize. Amazon Polly converts each item’s text to speech, using the set defaults plus any overrides given in the item, and places the resulting files in the S3 bucket in the set’s output prefix folder. If you specify an output file, the file is named as specified; otherwise, the solution assigns the file a name based on its contents and its order in the collection.

For example, if you want to synthesize six lines from Act 1 Scene 1 of Romeo and Juliet, you might use a YAML file that looks like the following code:

set:
  name: romeo-juliet
  output-prefix: act-1-scene-1
defaults:
  engine: neural 
  language-code: en-US
  output-format: mp3
  text-type: text
items:
  - text: Do you bite your thumb at us, sir?
    voice-id: Joey
  - text: I do bite my thumb, sir.
    voice-id: Matthew
  - text: <speak>Do you bite your thumb at <break/>us<break/>, sir?</speak>
    voice-id: Joey
    text-type: ssml
  - text: >
      <speak><amazon:effect name="whispered">Is the law of our side
      if I say aye?</amazon:effect></speak>
    voice-id: Matthew
    text-type: ssml
  - text: <speak><amazon:effect name="whispered">No.</amazon:effect></speak>
    voice-id: Brian
    text-type: ssml
  - text: No, sir. I do not bite my thumb at you, sir, but I bite my thumb, sir.
    voice-id: Matthew 

This set specifies that Amazon Polly should synthesize six lines from the play. To represent the characters Abraham, Sampson, and Gregory, we use the voices Joey, Matthew, and Brian. With Amazon Polly, you can specify volume and tone, like when Abraham emphasizes the word “us” and for Sampson’s and Gregory’s asides, which are whispered; for SSML effects like these, we simply specify that the text-type is ssml, and decorate the utterance appropriately.

Because none of the items specify an output file, the file names are generated automatically for you. In this example, the generated MP3 files are act-1-scene-1/item-000000-do-you-bite-your-thumb-at-us-sir.mp3 through act-1-scene-1/item-000005-no-sir-i-do-not-bite-my-thumb-at-you-sir.mp3.

This set file (and others) are in the docs/samples directory of the code. In CloudShell, you can send this file to Amazon Polly simply by uploading it to the S3 bucket you specified earlier:

$ aws s3 cp docs/samples/romeo-juliet.yml s3://[BUCKET NAME]

Amazon Polly synthesizes the six lines from the file. When all the lines have been synthesized, you get an email notification:

Your Amazon Polly batch set romeo-juliet completed with 6 successful tasks and 0 failures. The requested files are in s3://[BUCKET NAME]/act-1-scene-1/.

YAML can be created in any editor, is easy for humans to read, and is friendly for checking in to source control systems like AWS CodeCommit. However, the set file must be a pure text file, must have the .yml file extension, and must be valid YAML.

The Set Processor function

When a file with a .yml extension is uploaded to the S3 bucket, the Set Processor Lambda function kicks off the process. It parses the set file and creates a corresponding record for it in DynamoDB. This set record is used to keep track of how many items there are in the set, how many have yet to be completed, and when the set processing began.

Then, for each item in the collection, the Set Processor function posts a message—a work order, of sorts—to the solution’s Amazon Simple Queue Service (Amazon SQS) queue. This work order is a JSON document including everything Amazon Polly needs to synthesize the text per the instructions in the uploaded set file.

Each message is entirely independent of the others, so the work of synthesizing them can be done by Amazon Polly concurrently, and it doesn’t matter in what order they’re completed. The name of the set is also part of the work order, so multiple sets (or even multiple instances of the same set) can be processed by the solution at the same time.

The Item Processor function

The Item Processor Lambda function consumes messages from the SQS queue and posts work to Amazon Polly.

Each message represents a single audio file for Amazon Polly to create. The function calls the API method StartSpeechSynthesisTask, using the values in the work order as arguments to the method’s parameters. This is an asynchronous API call, so we have no guarantees as to when Amazon Polly actually generates the audio file for us; but when it’s complete, Amazon Polly publishes an SNS message for the next Lambda function, the Response Processor, to handle.

The Item Processor function also adds a record to the items table in DynamoDB, so the solution can keep track of which items have been successfully completed and which have not yet been.

Like many AWS APIs, there are limits to how many API calls you can make to Amazon Polly per second. The Item Processor function is throttled to stay within reasonable limits, and it backs off exponentially and retries as needed so as to post the work but still stay within your account service limits.

The Response Processor function

When Amazon Polly has completed work on a specific request, it posts a notification to the SNS response topic. This is immediately picked up by the final Lambda function in the sequence, the Response Processor. This function is responsible for updating the item and set records in DynamoDB, and for renaming the audio file in Amazon S3 to the requested file name.

If Amazon Polly reported success in synthesizing the audio file, then the Response Processor function simply moves the file to its final location. It updates the item record taskStatus to success and increments the success counter in the set record. If Amazon Polly reports failure, the function updates the item record with the reason for failure and increments the failed counter in the set record.

The Set Waiter workflow

To review, each of these Lambda functions runs only when triggered by an event:

  • The Set Processor is triggered when a set file is uploaded to the S3 bucket
  • The Item Processor is triggered when work orders appear in the SQS queue
  • The Response Processor is triggered when Amazon Polly publishes a message to the SNS topic

These functions can run concurrently, processing multiple items from multiple sets at the same time. Without an orchestration process, how do we know when a specific set is complete? How do we know if something went wrong?

The Set Waiter is a Step Functions workflow that’s responsible for watching a specific set to decide when it’s done, or to notify if a technical problem with the solution has left the set abandoned.

In the Step Functions Graph inspector, an in-process Set Waiter workflow looks like the following.

An instance of the Set Waiter is started by the Set Processor function for every submitted set, which passes a unique name identifying that set. The waiter loads the set record from the DynamoDB table in the load phase and checks to see if it’s complete in the check phase. If Amazon Polly still has tasks to process, the function waits a few seconds in the wait phase before starting again.

If every task in the set has been processed by Amazon Polly, the Set Waiter moves to the notify phase, which publishes a message to the completion SNS topic. If no changes have recently been made to an in-process set, the Set Waiter assumes that something is wrong and posts an abandoned message to the topic.

Clean up

You can leave the solution in your account for as long as you like. When it’s not in use, you pay only for the storage of the audio files in Amazon S3 and for the data in the DynamoDB tables. When you have text to synthesize, just upload a set file to the S3 bucket, and the solution takes it from there. You pay for the Lambda function invocations and the characters actually processed by Amazon Polly. Synthesizing all 1.1 million characters in Moby Dick, for example, costs less than $5 for the standard voices, and well under $20 for the higher-quality neural voices.

If you decide not to use the solution again, you can delete all its resources using AWS CloudFormation:

$ aws cloudformation delete-stack --stack-name amazon-polly-async-batch

Conclusion

In this post, we described a serverless, event-driven solution for submitting large amounts of text phrases for Amazon Polly to process asynchronously. With this approach, you can keep your costs low by paying only once for synthesis, no matter how many times you play the generated audio files.

You can deploy the solution to your account in minutes as an AWS SAM application. You specify the text to be converted in YAML files called set files. When a set file is uploaded to the solution’s S3 bucket (either manually by a human, or automatically by a code pipeline), a series of Lambda functions—the Set Processor, Item Processor, and Result Processor—work together to submit the tasks to Amazon Polly and collect the audio files for you. When all the work has been completed, a notification is published to an SNS topic.

The solution is developed as an open source project on GitHub. We welcome your feature requests, bug reports, or contributions. Try this out on your own and let us know what you think in the comments. To learn more about how Amazon Polly can help you, visit our webpage!


About the Authors

Jon Peterson is a Senior Solutions Architect with AWS. He lives outside of Chicago with his wife and two children.

Prateek Jain is a Solutions Architect with AWS, based out of Atlanta Georgia. He is passionate about Cloud and helping customers build amazing solutions on AWS.