基于 Claude 3 和 WhisperX 构建 ASR 方案（一）

1. 前言

人工智能的迅猛发展使语音识别成为日常生活和工作中不可或缺的技术。作为该领域的杰出开源项目，WhisperX 以其高效、准确和稳定的性能赢得广泛关注。本文将深入探讨 WhisperX 的一个关键特性——说话人分离，剖析其实现原理和应用场景，并指导您如何在 AWS 上部署和使用该模型。Whisper 是一种先进的深度学习语音识别技术，能将语音精确转换为文字。其核心优势在于高效的神经网络结构和创新的训练方法，使其能应对各种复杂场景，如嘈杂环境、多样口音和不同语速。WhisperX 作为 Whisper 的增强版，专注于视频字幕的生成和对齐。它不仅能将语音转化为文字，还能实现文字与视频帧的精确同步，生成带时间戳的字幕文件，大大提升了视频内容的可理解性和编辑便利性。说话人分离是 WhisperX 的一项重要功能，能从音频中识别不同说话人的身份。这在语音识别领域是一项具有挑战性的任务，因为音频往往包含多个说话人的声音，且他们的声音特征可能相似。准确区分说话人对提高语音转录的精确度至关重要。

应用场景

WhisperX 的说话人分离技术可以被广泛应用于多个领域，例如：

会议记录

在会议记录中，说话人分离技术可以帮助用户快速找到特定发言者的讲话内容。通过标注每段话的说话人，用户可以更容易地跟踪会议进程，了解不同人员的观点和发言。

司法取证

在司法取证中，该技术可以帮助调查人员识别出音频中的不同声音来源。通过分析音频中每个说话人的声音特征，可以确定嫌疑人或证人的身份，为案件调查提供重要线索。

智能家居

在智能家居中，说话人分离技术可以帮助用户区分不同家庭成员的声音指令，从而提高智能家居设备的个性化服务水平。例如，智能音箱可以根据说话人的身份，播放个性化的音乐或新闻推送。

视频字幕

在视频字幕领域，说话人分离技术可以帮助准确地标注每个人物的对白，使字幕更加清晰易读。这对于电影、电视剧等视频内容的观看体验至关重要。

2. 架构介绍

整体架构图：

2.1 架构介绍

本方案采用了多层架构，结合了前端用户界面、强大的后端 GPU 和 WhisperX 音频处理能力以及业界先进的 Claude 3 大模型。下面是各个组件的说明：

用户界面（UI 前端）：提供直观的音频输入界面，用户可以在这里输入或上传需要处理的音频数据。来自 UI 前端的音频数据被发送到下一个处理阶段。
WhisperX 处理和分析：WhisperX 直接运行在 EC2 实例上实现语音识别和处理服务。它对接收到的音频数据进行分析和处理，能将语音精确转换为文字。
Amazon Bedrock：转换得到的文本数据通过 Amazon Bedrock Claude 3 进行总结，可以对转录的文本识别语言种类，理解情节以及分析对话中每个人的情感和内容。 Claude 3 是一种新型的大型语言模型，由 Anthropic 公司开发。现在 Amazon Bedrock 上已经提供 Claude 3 的 Sonnet、Haiku 和 Opus 3个模型，本文将使用 Haiku 模型。与其他语言模型相比，Claude 3 在以下几个方面表现出色：更强的推理和分析能力、更好的常识理解、更高的鲁棒性和一致性、更注重安全性和可控性。

2.2 WhisperX 介绍

WhisperX 的流程为：VAD 分析 –> fasterWhisper 转写（without timestamp）–> 音素分析模型进行音频分析 –> 音素分析结果与转写结果合并重新对齐时间戳 –> speaker-brain 相关模型声源分析 –> 分析结果与对齐结果组合，为字幕添加说话人信息 –> 输出字幕文件。

WhisperX 的实现原理

为了解决说话人分离的问题，WhisperX 采用了多种先进的技术和方法。

特征提取

WhisperX 利用了深度学习技术来提取音频中的特征，如音调、音色、语速等。这些特征能够有效地描述说话人的声音特征。通过训练一个深度学习模型，WhisperX 能够从音频中提取出这些特征，为后续的说话人分离提供必要的信息。

聚类算法

WhisperX 利用了聚类算法来对提取出的特征进行分类。这些算法能够将相似的特征归为同一类，从而实现说话人的分离。具体来说，WhisperX 使用了无监督学习的聚类算法，如 K-means 或 DBSCAN 等。通过对训练数据的聚类分析，这些算法能够自动地学习到不同说话人的声音特征，从而为后续的说话人分离提供参考。

动态时间规整（DTW）

除了聚类算法外，WhisperX 还利用了动态时间规整（DTW）算法来对不同说话人的声音进行匹配。DTW 算法能够有效地处理不同长度和节奏的音频序列，从而提高说话人分离的准确性。它通过计算两个时间序列之间的最小距离，来判断它们是否属于同一个说话人。

语音活动检测（VAD）

另一个重要的技术是语音活动检测（VAD）。VAD 能够从音频中区分出人声和非人声部分，如背景噪音、静音等。通过去除非人声部分，WhisperX 可以更好地关注有用的语音信号，提高说话人分离的效果。

3. 部署项目

本项目实现了在 AWS 上自动部署和使用 WhisperX 并使用 Claude 对转录结果进行总结，项目包括以下部分：

Streamlit UI

一个基于 Streamlit 的 Python 应用程序，提供简单的 Web 界面使用 WhisperX 模型将音频转换为文本。用户可以通过该界面上传音频文件或输入 YouTube 视频链接，然后 WhisperX 会自动进行语音转录和说话人分离。

AWS CloudFormation

一个 AWS CloudFormation YAML 文件，自动提供 AWS G4 实例，并安装 Nvidia 驱动程序和运行 Streamlit UI 所需的 WhisperX 相关库。CloudFormation 可以一键创建所需的 VPC 网络环境，简化部署流程。本文源代码已经发布在 https://github.com/superyhee/whisper-on-aws-jumpstart，可以在亚马逊云科技的 EC2 上进行一键部署。

安装指南：

#克隆代码仓库

git clone https://github.com/superyhee/whisper-on-aws-jumpstart

注册 huggingface 账号，获取 huggingface token
接受相关模型的用户协议，包括分割、语音活动检测（VAD）和说话人分离模型

通过 AWS 控制台创建活动 EC2 密钥对，用于远程连接实例
从 EC2 控制台获取 ubuntu 系统的 AMI ID（用 ubuntu 22.04 版本）
在 CloudFormation 控制台中创建堆栈
设置项目参数，如实例类型、密钥对等
等待大约 10 分钟，让 EC2 实例初始化环境并安装所需库，包括 NVIDIA 驱动程序和 WhisperX
通过 SSH 连接到 EC2 实例，运行 nvidia-smi 检查 NVIDIA 系统是否正常运行
导航到 whisper 目录，默认情况下 python3 ui.py 服务已在运行。您可以使用 sudo systemctl stop myapp.service 停止服务
通过浏览器访问输出的 IP 地址查看 Streamlit UI，例如 http://{ip_address}:8501

4. 项目使用指南

自动下载并将 YouTube 视频转录为文本。只需在界面上输入 YouTube 视频链接，WhisperX 会自动下载视频，进行语音转录和说话人分离

上传 MP3 或其他音频文件并转录为文本。您可以直接上传本地音频文件，WhisperX 会对其进行处理并输出转录结果
- 在转录结果中，不同说话人的发言会使用不同颜色进行标注，方便区分
- 您可以调整一些参数，如输出格式、语言模型等，以获得更好的转录效果

通过 Amazon Bedrock Claude 3 对转录文本进行总结和分析

下面是 Claude 3 的提示词，可以对转录的文本识别语言种类，理解情节以及分析对话中每个人的情感和内容：

        system_prompt = """
你是一个文案专员，请认真阅读其中的内容<transcription_text>标签中包含的上下文内容，并按照以下要求进行总结
- 识别 <transcription_text> 中的语言种类，用相同语言进行总结和返回
- 理解 <transcription_text> 中的主要情节和场景，用精简的语言总结内容
- 如果 <transcription_text> 中有多个speaker，请分别总结每个人的情感情绪和想要表达的中心思想

以下是上下文:
<transcription_text>
{speak_context}
</transcription_text>
"""

下面例子是一个面试对话的分析和总结：

5. 项目代码说明

whisperx_transcribe.py

import whisperx
import gc
import os
import time
from dotenv import load_dotenv
load_dotenv()

device = "cuda"
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 3. Assign speaker labels
# diarize_model = whisperx.DiarizationPipeline(use_auth_token=os.environ['HF_TOKEN'], device=device)

diarize_model = whisperx.DiarizationPipeline("pyannote/speaker-diarization-3.1",use_auth_token=os.environ['HF_TOKEN'], device=device)
whisper_models = {}
model_a = {}
metadata = {}

def convert_format(data):
    result = []
    for item in data:
        new_item = {
            "start": item["start"],
            "end": item["end"],
            "text": item["text"],
            "speaker": item["words"][0]["speaker"]
        }
        result.append(new_item)
    return result

# audio_file must be mp3 or wav
def transcribe(audio_file, model_needed, language=None):
    batch_size = 8 # reduce if low on GPU mem

    # asr_options = {
    #     'beam_size': 5, 'patience': None, 'length_penalty': 1.0, 'temperatures': (0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0),
    #     'compression_ratio_threshold': 2.4, 'log_prob_threshold': -1.0, 'no_speech_threshold': 0.6, 'condition_on_previous_text': False,
    #     'initial_prompt': None, 'suppress_tokens': [-1], 'suppress_numerals': False,
    #     "max_new_tokens": None, "clip_timestamps": None, "hallucination_silence_threshold": None,
    #     "repetition_penalty": 1,
    #     "prompt_reset_on_temperature": 0.5,
    #     "no_repeat_ngram_size": 0
    # }
    # vad_options = {'vad_onset': 0.5, 'vad_offset': 0.363}

    if not model_needed in whisper_models:
        whisper_models[model_needed] = whisperx.load_model(
            model_needed, device=device, compute_type=compute_type )

    # 1. Transcribe with original whisper (batched)
    model = whisper_models[model_needed]

    audio = whisperx.load_audio(audio_file)
    transcribe_args = {}
    if language != None:
        transcribe_args["language"] = language
    start_time = time.time()
    result = model.transcribe(audio, batch_size=batch_size)
    transcribe_time = time.time()
    execution_time = transcribe_time - start_time
    print(f"transcribe_time: {execution_time} seconds")
    #print(result["segments"]) # before alignment

    # delete model if low on GPU resources
    # import gc; gc.collect(); torch.cuda.empty_cache(); del model

    # 2. Align whisper output
    try:
        start_time = time.time()
        model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
        load_align_model_time = time.time()
        execution_time = load_align_model_time - start_time
        print(f"load_align_model_time: {execution_time} seconds")
        result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
        align_time = time.time()
        execution_time = align_time - load_align_model_time
        print(f"align_time: {execution_time} seconds")
    except:
        print("Fail to align", result["language"], "lang")

    #print(result["segments"]) # after alignment

    # delete model if low on GPU resources
    # import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

    # add min/max number of speakers if known
    diarize_segments = diarize_model(audio_file)
    # diarize_model(audio_file, min_speakers=min_speakers, max_speakers=max_speakers)

    result = whisperx.assign_word_speakers(diarize_segments, result)
    # print(result)

    #for segment in result["segments"]:
    return convert_format(result["segments"])

在 transcribe 函数中：

首先检查是否已经加载了指定的语音识别模型，如果没有，则加载该模型
使用 whisperx.load_audio 加载音频文件
调用语音识别模型的 transcribe 方法进行语音识别，并记录执行时间
尝试加载对齐模型，并使用 whisperx.align 函数对语音识别结果进行对齐，记录执行时间
使用 diarize_model 进行说话人分离
调用 whisperx.assign_word_speakers 函数将说话人标签分配给每个单词
最后，使用 convert_format 函数将结果转换为指定的格式，并返回

Streamlit UI

from dotenv import load_dotenv
from bedrock_handler.summary_bedrock_handler import SummaryBedrockHandler
load_dotenv()


import streamlit as st
import yt_dlp
import subprocess
import os
import re
import whisperx_transcribe
import tempfile

def extract_video_id(url):
    regex = r"(?<=v=)[^&#]+|(?<=be/)[^&#]+"
    match = re.search(regex, url)
    return match.group(0) if match else None

def extract_audio(video_path, audio_path):
    command = ['ffmpeg', '-i', video_path, '-vn', '-y', audio_path]
    subprocess.run(command, check=True)

def download(video_id: str) -> str:
    video_id = video_id.strip()
    video_url = f'https://www.youtube.com/watch?v={video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/best',
        'paths': {'home': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
        }]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.download([video_url])
        if error_code != 0:
            raise Exception('Failed to download video')

    return f'audio/{video_id}.m4a'

def process(youtube_url, language):
    video_id = extract_video_id(youtube_url)
    if not video_id:
        st.error("Invalid YouTube URL")
        return

    try:
        progress_text = "Downloading video..."
        progress_value = 0
        progress_bar = st.progress(progress_value, text=progress_text)
        audio_file = download(video_id)
        progress_value = 33
        progress_bar.progress(progress_value, text=progress_text)
    except Exception as e:
        st.error(f"Failed to download video: {e}")
        return

    audio_file_mp3 = 'audio/audio.mp3'
    progress_text = "Converting audio format..."
    progress_value = 66
    progress_bar.progress(progress_value, text=progress_text)
    subprocess.run(['ffmpeg', '-i', audio_file, '-y', audio_file_mp3], check=True)

    progress_text = "Transcribing audio..."
    progress_value = 70
    progress_bar.progress(progress_value, text=progress_text)
    transcription = whisperx_transcribe.transcribe(audio_file_mp3, "large", language=language)
    progress_text = "Transcribing completed..."
    progress_value = 100
    progress_bar.progress(progress_value, text=progress_text)
    st.write(transcription)
    # Remove temporary files
    os.remove(audio_file)
    os.remove(audio_file_mp3)
    return transcription

def main():
    st.title("Audio Transcription")
    if 'transcription' not in st.session_state:
        st.session_state.transcription=""
    tabs = st.tabs(["YouTube Video", "MP3 File"])
    with tabs[0]:
        youtube_url = st.text_input("Enter YouTube URL")
        language = None
        transcribe_button = st.button("Transcribe",key="url")
        summary_button = st.button("Summary",key="summary")

        if transcribe_button:
           
            #transcribe_button = st.button("Transcribe", disabled=True)  # Disable the button
            st.session_state.transcription = process(youtube_url, language)
            #transcribe_button = st.button("Transcribe")  # Enable the button after processing
        if summary_button:
            print(st.session_state.transcription)
            llm = SummaryBedrockHandler(region="us-west-2",content=st.session_state.transcription)
            response_body = llm.invoke()
            st.json(st.session_state.transcription)
            st.write(response_body)

    with tabs[1]:
        mp3_file = st.file_uploader("Upload MP3 File", type=["mp3"])
        language = None
        transcribe_mp3_button = st.button("Transcribe",key="mp3")

        summary_button_mp3 = st.button("Summary",key="summary_mp3")

        if transcribe_mp3_button:
            progress_text = "Processing MP3 file..."
            progress_value = 10
            progress_bar = st.progress(progress_value, text=progress_text)
            # Save the uploaded file to a temporary file
            with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as tmp_file:
                tmp_file.write(mp3_file.getvalue())
                tmp_file_path = tmp_file.name

            transcription = whisperx_transcribe.transcribe(tmp_file_path, "large", language=language)
            st.session_state.transcription=transcription
            st.write(transcription)
            progress_value = 100
            progress_text = "Processing completed..."
            progress_bar.progress(progress_value, text=progress_text)

            # Remove the temporary file
            os.unlink(tmp_file_path)
        if summary_button_mp3:
            llm = SummaryBedrockHandler(region="us-west-2",content=st.session_state.transcription)
            response_body = llm.invoke()
            st.json(st.session_state.transcription)
            st.write(response_body)

if __name__ == "__main__":
    main()

使用 Streamlit 框架构建的网页应用程序，允许用户从 YouTube 视频或本地 MP3 文件中转录音频，并使用 WhisperX 语音识别模型。它还提供了使用 Bedrock LLM 服务对转录文本进行总结的功能。代码的主要功能如下：

extract_video_id 函数使用正则表达式从给定的 YouTube URL 中提取视频 ID
extract_audio 函数使用 ffmpeg 从视频文件中提取音频
download 函数使用 yt_dlp 下载 YouTube 视频，并返回下载的音频文件路径
process 函数处理整个过程，包括下载 YouTube 视频、转换音频格式、使用 WhisperX 转录音频，并显示转录结果
main 函数设置 Streamlit UI，包含两个选项卡：一个用于 YouTube 视频，一个用于本地 MP3 文件

6. 总结

本文介绍了一种基于 AWS 云服务、WhisperX 开源语音识别模型和 Claude 3 大型语言模型的自动语音转录（ASR）方案。该方案为语音数据处理提供了完整的端到端解决方案和参考实现。WhisperX 的语音视频与字幕对齐技术为多媒体内容处理带来了革命性变化。它不仅提高了视频内容的可理解性和编辑效率，还为视频制作、教育和娱乐等领域开创了创新可能。随着技术不断进步，未来的语音视频与字幕对齐技术有望变得更加精准、高效和智能。

*前述特定亚马逊云科技生成式人工智能相关的服务仅在亚马逊云科技海外区域可用，亚马逊云科技中国仅为帮助您了解行业前沿技术和发展海外业务选择推介该服务。

亚马逊AWS官方博客