Transcribe speech to text in real time using Amazon Transcribe with WebSocket

October 2024: This post was reviewed and updated for accuracy.

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to applications. In November 2018, we added streaming transcriptions over HTTP/2 to Amazon Transcribe. This enabled users to pass a live audio stream to the service and, in return, receive text transcripts in real time. Amazon Transcribe supports real-time transcriptions over the WebSocket protocol. WebSocket support makes streaming speech-to-text through Amazon Transcribe more accessible to a wider user base, especially for those who want to build browser or mobile-based applications.

In this blog post, we assume that you are aware of our streaming transcription service running over HTTP/2, and focus on showing you how to use the real-time offering over WebSocket. However, for reference on using HTTP/2, you can read our previous blog post, Amazon Transcribe now supports real-time transcriptions, and documentation on Transcribing streaming audio.

What is WebSocket?

WebSocket is a full-duplex communication protocol built over TCP. The protocol was standardized by the IETF as RFC 6455 in 2011. WebSocket is suitable for long-lived connectivity whereby both the server and the client can transmit data over the same connection at the same time. It is also practical for cross-domain usage. Voila! No need to worry about cross-origin resource sharing (CORS) as there would be when using HTTP.

Using Amazon Transcribe streaming with WebSocket

To use Amazon Transcribe’s StartStreamTranscriptionWebSocket API, you first need to authorize your IAM user to use the Amazon Transcribe Streaming WebSocket. Go to the AWS Management Console, navigate to Identity & Access Management (IAM), and attach the following inline policy to your user in the AWS IAM console. Please refer to “To embed an inline policy for a user or role” for instructions on how to add permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        "Sid": "transcribestreaming",
        "Effect": "Allow",
        "Action": "transcribe:StartStreamTranscriptionWebSocket",
        "Resource": "*"
    ]
}

Your upgrade request should be pre-signed with your AWS credentials using the AWS Signature Version 4. The request should contain the required parameters, namely sample-rate, language code, and media-encoding. You could optionally supply vocabulary-name to use a custom vocabulary. The StartStreamTranscriptionWebSocket API supports all of the languages that Amazon Transcribe streaming supports today. After your connection is upgraded to WebSocket, you can send your audio chunks as an AudioEvent of the event-stream encoding in the binary WebSocket frame. The response you get is the transcript JSON, which would also be event-stream encoded. For more details, please refer to our technical documentation on Event stream encoding.

Solution Overview

Whether you’re building a live captioning system, a voice-controlled interface, or a meeting transcription tool, the ability to convert speech to text in real-time can significantly improve your application’s functionality. We created a sample static website to showcase how to leverage Amazon Transcribe’s WebSocket API to create a real-time transcription service using Node.js. The complete sample code is available on GitHub.

Prerequisites

Before you use this feature from your AWS account, ensure that you have following resources set up:

An AWS Account
IAM user with the permission transcribe:StartStreamTranscriptionWebSocket
Node.js installed on your local machine.

Implementation

Here are the implementation steps:

Set up a Node.js server with Express and Socket.IO.
Create a frontend for the transcription service.
Implement real-time audio streaming from the browser to the server.
Use Amazon Transcribe’s WebSocket API to perform real-time transcription.
Running the Application

Let’s go through each of these steps in detail.

Set up a Node.js server with Express and Socket.IO

To create a robust real-time transcription platform, set up a Node.js server using Express and Socket.IO. This allows us to handle HTTP requests and maintain WebSocket connections for seamless, bidirectional communication between the client and the server.

First, clone the git repository amazon-transcribe-websocket that contains the code using the below command:

git clone https://github.com/aws-samples/amazon-transcribe-websocket.git

Now, create a new directory for your project and initialize it with npm:

cd amazon-transcribe-websocket
npm init -y

This creates a new directory. Navigate into it, and initialize a new Node.js project with default settings. Next, install the necessary dependencies:

npm install express http socket.io @aws-sdk/client-transcribe-streaming

Let’s break down these dependencies:

express: A minimal and flexible Node.js web application framework that provides a robust set of features for web and mobile applications.
http: Node.js built-in module to create an HTTP server.
socket.io: A library that enables real-time, bidirectional and event-based communication between the browser and the server.
@aws-sdk/client-transcribe-streaming: The official AWS SDK for JavaScript, specifically for interacting with Amazon Transcribe’s streaming API.

Now, create a file named server.js and add the following code:

const express = require('express');
const http = require('http');
const path = require('path');
const { Server } = require('socket.io');
const { TranscribeStreamingClient, StartStreamTranscriptionCommand } = require("@aws-sdk/client-transcribe-streaming");

const app = express();
const server = http.createServer(app);
const io = new Server(server);

app.use(express.static(path.join(__dirname)));

app.get('/', (req, res) => {
    res.sendFile(path.join(__dirname, 'index.html'));
});

const transcribeClient = new TranscribeStreamingClient({
    region: "us-west-2", // Ensure this matches your AWS region
});

io.on('connection', (socket) => {
    console.log('A user connected');

    let audioStream;
    let lastTranscript = '';
    let isTranscribing = false;

    socket.on('startTranscription', async () => {
        console.log('Starting transcription');
        isTranscribing = true;
        let buffer = Buffer.from('');

        audioStream = async function* () {
            while (isTranscribing) {
                const chunk = await new Promise(resolve => socket.once('audioData', resolve));
                if (chunk === null) break;
                buffer = Buffer.concat([buffer, Buffer.from(chunk)]);
                console.log('Received audio chunk, buffer size:', buffer.length);

                while (buffer.length >= 1024) {
                    yield { AudioEvent: { AudioChunk: buffer.slice(0, 1024) } };
                    buffer = buffer.slice(1024);
                }
            }
        };

        const command = new StartStreamTranscriptionCommand({
            LanguageCode: "en-US",
            MediaSampleRateHertz: 44100,
            MediaEncoding: "pcm",
            AudioStream: audioStream()
        });

        try {
            console.log('Sending command to AWS Transcribe');
            const response = await transcribeClient.send(command);
            console.log('Received response from AWS Transcribe');
            
            for await (const event of response.TranscriptResultStream) {
                if (!isTranscribing) break;
                if (event.TranscriptEvent) {
                    console.log('Received TranscriptEvent:', JSON.stringify(event.TranscriptEvent));
                    const results = event.TranscriptEvent.Transcript.Results;
                    if (results.length > 0 && results[0].Alternatives.length > 0) {
                        const transcript = results[0].Alternatives[0].Transcript;
                        const isFinal = !results[0].IsPartial;

                        if (isFinal) {
                            console.log('Emitting final transcription:', transcript);
                            socket.emit('transcription', { text: transcript, isFinal: true });
                            lastTranscript = transcript;
                        } else {
                            const newPart = transcript.substring(lastTranscript.length);
                            if (newPart.trim() !== '') {
                                console.log('Emitting partial transcription:', newPart);
                                socket.emit('transcription', { text: newPart, isFinal: false });
                            }
                        }
                    }
                }
            }
        } catch (error) {
            console.error("Transcription error:", error);
            socket.emit('error', 'Transcription error occurred: ' + error.message);
        }
    });

    socket.on('audioData', (data) => {
        if (isTranscribing) {
            console.log('Received audioData event, data size:', data.byteLength);
            socket.emit('audioData', data);
        }
    });

    socket.on('stopTranscription', () => {
        console.log('Stopping transcription');
        isTranscribing = false;
        audioStream = null;
        lastTranscript = '';
    });

    socket.on('disconnect', () => {
        console.log('User disconnected');
        isTranscribing = false;
        audioStream = null;
    });
});

const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
    console.log(`Server is running on http://localhost:${PORT}`);
});

The server.js sets up a WebSocket connection using Socket.IO and handles the communication between the client and Amazon Transcribe. This architecture allows multiple clients to connect simultaneously, each with their own transcription stream, making it suitable for applications ranging from personal voice assistants to large-scale transcription services.

Create a frontend for the transcription service

Now that the server is set up, create a user-friendly interface for our transcription service. For the ease of understanding, we have built a simple yet functional HTML page that allows users to start and stop transcription, and view the results in real-time.

Create a file named index.html and style.css in the same directory as your server.js file and add the following code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Real-time Transcription</title>
    <script src="/socket.io/socket.io.js"></script>
    <link rel="stylesheet" href="style.css">
</head>
<body>
    <div class="page-container">
        <div class="container">
            <h1>Real-time Audio Transcription</h1>
            <div class="status">
                Status: <span id="statusText">Not recording</span> <span id="statusIndicator">⚪</span>
            </div>
            <div class="button-container">
                <button id="startButton">Start Transcription</button>
                <button id="stopButton">Stop Transcription</button>
                <button id="clearButton">Clear Transcript</button>
            </div>
            <div id="transcript"></div>
            <div class="info-section">
                <h2>How to use:</h2>
                <ul>
                    <li>Click "Start Transcription" to begin recording.</li>
                    <li>Speak clearly into your microphone.</li>
                    <li>Watch as your speech is transcribed in real-time.</li>
                    <li>Click "Stop Transcription" when you're done.</li>
                    <li>Use "Clear Transcript" to remove all transcribed text.</li>
                </ul>
            </div>
            <div class="footer">
                <p>© Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. </p>
            </div>
        </div>
    </div>
    <script>
        const socket = io();
        let audioContext;
        let audioInput;
        let processor;
        const startButton = document.getElementById('startButton');
        const stopButton = document.getElementById('stopButton');
        const clearButton = document.getElementById('clearButton');
        const statusText = document.getElementById('statusText');
        const statusIndicator = document.getElementById('statusIndicator');
        const transcript = document.getElementById('transcript');

        let currentTranscript = '';
        let lastFinalIndex = 0;

        startButton.addEventListener('click', startRecording);
        stopButton.addEventListener('click', stopRecording);
        clearButton.addEventListener('click', clearTranscript);

        function updateStatus(status) {
            console.log('Status updated:', status);
            statusText.textContent = status;
            statusIndicator.textContent = status === 'Recording' ? '🔴' : '⚪';
        }

        async function startRecording() {
            console.log('Start button clicked');
            try {
                const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
                console.log('Microphone access granted');
                audioContext = new AudioContext();
                audioInput = audioContext.createMediaStreamSource(stream);
                processor = audioContext.createScriptProcessor(1024, 1, 1);
                audioInput.connect(processor);
                processor.connect(audioContext.destination);

                processor.onaudioprocess = (e) => {
                    const float32Array = e.inputBuffer.getChannelData(0);
                    const int16Array = new Int16Array(float32Array.length);
                    for (let i = 0; i < float32Array.length; i++) {
                        int16Array[i] = Math.max(-32768, Math.min(32767, Math.floor(float32Array[i] * 32768)));
                    }
                    console.log('Sending audio chunk to server, size:', int16Array.buffer.byteLength);
                    socket.emit('audioData', int16Array.buffer);
                };

                socket.emit('startTranscription');
                console.log('startTranscription event emitted');
                updateStatus('Recording');
            } catch (error) {
                console.error('Error accessing microphone:', error);
                updateStatus('Error: ' + error.message);
            }
        }

        function stopRecording() {
            console.log('Stop button clicked');
            if (audioContext && audioContext.state !== 'closed') {
                audioInput.disconnect();
                processor.disconnect();
                audioContext.close();
                socket.emit('stopTranscription');
                updateStatus('Not recording');
            }
        }

        function clearTranscript() {
            console.log('Clear button clicked');
            currentTranscript = '';
            lastFinalIndex = 0;
            transcript.textContent = '';
        }

        socket.on('transcription', data => {
            console.log('Received transcription:', data);
            if (data.isFinal) {
                currentTranscript += data.text + ' ';
                lastFinalIndex = currentTranscript.length;
            } else {
                const partialTranscript = currentTranscript + data.text;
                transcript.textContent = partialTranscript;
            }
            transcript.textContent = currentTranscript;
        });

        socket.on('error', errorMessage => {
            console.error('Server error:', errorMessage);
            transcript.textContent += '\nError: ' + errorMessage;
        });

        console.log('Client-side script loaded');
    </script>
</body>
</html>

This creates a simple interface with Start Transcription and Stop Transcription buttons, and a area to display the transcription results. The embedded script handles the core functionality:

It sets up a WebSocket connection using Socket.IO.
It implements the start and stop transcription functions, which are triggered by button clicks.
The start function requests access to the user’s microphone, sets up a MediaRecorder to capture audio, and sends audio chunks to the server.
The stop function ends the recording and tells the server to stop transcription.
It listens for transcription results from the server and displays them in real-time.

When a user clicks Start Transcription, it begins capturing audio from their microphone and sending it to the server. As the server receives transcription results from Amazon Transcribe, these results are sent back to the client and displayed on the page in real-time.

Implement real-time audio streaming from the browser to the server

A crucial part of this real-time transcription service is the ability to stream audio from the user’s browser to the server. This is handled by the startRecording function in the HTML file. Let’s break down this process in more detail:

Accessing the user’s microphone: We use the mediaDevices.getUserMedia API to request access to the user’s microphone. This is a modern web API that allows web applications to access media devices:
```
const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); 
```

This line prompts the user for permission to use their microphone and returns a MediaStream object if the access is granted.

Processing and sending audio data: We set up an onaudioprocess event handler on our processor node to capture and send audio data:
```
processor.onaudioprocess = (e) => {
    const float32Array = e.inputBuffer.getChannelData(0);
    const int16Array = new Int16Array(float32Array.length);
    for (let i = 0; i < float32Array.length; i++) {
        int16Array[i] = Math.max(-32768, Math.min(32767, Math.floor(float32Array[i] * 32768)));
    }
    socket.emit('audioData', int16Array.buffer);
};
```
- This function is called repeatedly with chunks of audio data.
- We convert the audio data from 32-bit floating point to 16-bit integer format, which is more commonly used in audio processing and reduces the amount of data we need to send.
- We then emit this data to the server using our Socket.IO connection.

Starting the audio stream: We connect our audio nodes and emit a startTranscription event to the server:

audioInput.connect(processor);
processor.connect(audioContext.destination);
socket.emit('startTranscription');

This sets up the audio processing pipeline and signals the server to start a new transcription session.

This implementation allows for low-latency, real-time streaming of audio data from the browser to our server. By processing the audio in small chunks and sending it immediately, we enable near-instantaneous transcription of the user’s speech.

The use of the Web Audio API gives us fine control over the audio data, allowing us to optimize it before sending (like converting to 16-bit integers). This can help reduce bandwidth usage and processing load on both the client and server.

Use Amazon Transcribe’s WebSocket API to perform real-time transcription

In the server.js file, we use the @aws-sdk/client-transcribe-streaming library to seamlessly interact with Amazon Transcribe for real-time speech-to-text conversion. This integration allows us to process audio streams and receive transcription results on the fly. Let’s break down the key components:

Setting up the TranscribeStreamingClient: Initialize a client to communicate with Amazon Transcribe:

const transcribeClient = new TranscribeStreamingClient({
  region: "us-west-2", // Replace with your AWS region
});

This client is configured with your AWS region and will handle all communication with the Amazon Transcribe service.

Creating an audio stream: Set up an async generator function to handle the incoming audio data, when a transcription session is started:

let audioStream = async function* () {
  while (isTranscribing) {
    const chunk = await new Promise(resolve => socket.once('audioData', resolve));
    if (chunk === null) break;
    buffer = Buffer.concat([buffer, Buffer.from(chunk)]);
    while (buffer.length >= 1024) {
      yield { AudioEvent: { AudioChunk: buffer.slice(0, 1024) } };
      buffer = buffer.slice(1024);
    }
  }
};

This function:

- Waits for audio chunks from the client
- Buffers the audio data
- Yields audio chunks in the format expected by Amazon Transcribe

Processing transcription results: Use an async iterator to process the transcription results as they arrive:

for await (const event of response.TranscriptResultStream) {
  if (!isTranscribing) break;
  if (event.TranscriptEvent) {
    const results = event.TranscriptEvent.Transcript.Results;
    if (results.length > 0 && results[0].Alternatives.length > 0) {
      const transcript = results[0].Alternatives[0].Transcript;
      const isFinal = !results[0].IsPartial;
      // Process and emit the transcription result
      // ...
    }
  }
}

This loop:

- Checks each event for transcription data
- Extracts the transcribed text
- Determines if the transcription is final or partial
- Processes and emits the result back to the client

This implementation creates a robust, real-time transcription system. It efficiently handles the streaming of audio data from the client to Amazon Transcribe and the return of transcription results to the client. The use of async iterators and generators allows for smooth, non-blocking processing of the audio stream and transcription results.

By differentiating between partial and final results, you provide a responsive user experience where transcriptions appear quickly and are refined in real-time. This approach balances immediacy with accuracy, giving users a sense of the transcription as it’s happening while still providing polished, final results.

Running the Application

Now that you’ve set up both our server and client-side code, it’s time to bring our real-time transcription application to life. Follow these steps to run and use the application:

Set up AWS credentials: Before running the application, ensure your AWS credentials are properly configured. This is crucial for authenticating with Amazon Transcribe.
- Install the AWS CLI (Command Line Interface), if you haven’t already.
- Run aws configure in your terminal.
- Enter your AWS Access Key ID, Secret Access Key, and default region (e.g., us-west-2).
- Alternatively, you can set environment variables:
```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_REGION=your_region
```
Start the server: Open a terminal, navigate to your project directory, and start the Node.js server:
```
node server.js
```

You should see a message saying “Server is running on http://localhost:3000“. Open your preferred web browser. Click the Start Transcription button. If it’s your first time, your browser will ask for permission to access your microphone. Allow this access. You’ll see your speech being transcribed in real-time in the transcript area. To stop transcribing, click the Stop Transcription button.

Remember, this application is running locally on your machine. If you plan to use it for a broader use case, you should take into account hosting options and security measures to ensure the protection of your AWS credentials. Please refer to this documentation for enforcing access management for AWS resources.

Conclusion

In this post, we demonstrated how to create a real-time speech-to-text application using Amazon Transcribe’s WebSocket API, Node.js, and a simple frontend. We showed how to create a server that handles WebSocket connections, stream audio data from the browser to the cloud, and display live transcription results. This setup allows for seamless integration of real-time transcription capabilities into web applications, opening up a world of possibilities for voice-enabled interfaces, accessibility features, and more.

Remember, this is just a starting point. You can extend this basic implementation to include features like language selection, custom vocabulary, or even multi-speaker dualization. You can learn more about Amazon Transcribe in the AWS documentation and explore Socket.IO for advanced real-time communication features.

About the authors

Bhaskar Bagchi is an engineer in the Amazon Transcribe service team. Outside of work, Bhaskar enjoys photography and singing.

Karan Grover is an engineer in the Amazon Transcribe service team. Outside of work, Karan enjoys hiking and is a photography enthusiast.

Paul Zhao is a Product Manager at AWS Machine Learning. He manages the Amazon Transcribe service. Outside of work, Paul is a motorcycle enthusiast and avid woodworker.

Miriam Lebowitz is a Solutions Architect, and specializes in machine learning at AWS. Outside of work, she enjoys baking, traveling, and spending quality time with friends and family.

Nishant Dhiman is a Senior Solutions Architect at AWS with an extensive background in Serverless, Generative AI, Security and Mobile platform offerings. He is a voracious reader and a passionate technologist. He loves to interact with customers and believes in giving back to community by learning and sharing. Outside of work, he likes to keep himself engaged with podcasts, calligraphy and music.

Achintya Veer Singh is a Solutions Architect at AWS based in Bangalore. He works with AWS customers to address their business challenges by designing secure, performant, and scalable solutions leveraging the latest cloud technologies. He is passionate about technology and enjoys building and experimenting with AI/ML and Gen AI. Outside of work, he enjoys cooking, reading non-fiction books, and spending time with his family.

Audit History

Last reviewed and updated in August 2024 by Miriam Lebowitz | Solutions Architect, Nishant Dhaman | Senior Solutions Architect and Achintya Veer Singh | Solutions Architect.

AWS Machine Learning Blog