基于 Amazon Bedrock Agent 的云资源智能运维 – 以 EBS 卷管理为例

业务背景

某客户当前有大量的 Amazon EBS 卷依然是 gp2 类型，通过和客户一起分析，如果将这些 Amazon EBS 卷做类型转换到 gp3，可以节省 20% 的成本。

技术选型

因为客户的 Amazon EBS 卷数量超过 200 个，最大的超过 1T 的容量。虽然对于 Amazon EBS 类型转换，亚马逊云科技从底层机制上保证了对上层的应用无感知，但为了安全和稳定的考虑，我们建议客户分批次进行操作，按照某个 EBS 卷的使用对象和项目归属等进行标记，然后每天选择 10 个左右的卷进行转换。为减轻客户手动操作的复杂度以及更好的对整个过程进行监控和运维，我们为客户提供了两种运维的手段，除了使用批量 Tag 和 AWS Lambda 函数进行 EBS 卷类型转换，我们还提供了基于 Amazon Bedrock Agent 能力的智能运维体验，让客户可以方便的通过大模型对话的方式对感兴趣的 EBS 卷状态进行查询和修改。

本文主要介绍基于 Amazon Bedrock Agent 实现的智能 EBS 卷运维的具体配置方式。

方案效果

Bedrock Agent 原理

Amazon Bedrock 是 AWS 的生成式 AI 服务平台，旨在简化和加速企业级 AI 应用的开发。它支持多种预训练的大型语言模型（LLM），如 Titan 和 Claude，用户可以轻松选择和集成这些模型。 Amazon Bedrock Agent，旨在简化自动化任务和操作。它支持通过自然语言交互，用户可以轻松执行复杂任务，如管理 EBS Volumes 等。Agent 将用户请求分解成多个步骤，自动调用 API 完成具体操作。此外，Agent 可集成知识库，增强响应能力，提供更准确和详细的答案。无需编写大量代码，开发者即可轻松创建和配置 Agent，自动管理基础设施和安全，使和 IT 工具的集成更为简便。

Bedrock Agent 可以帮助大语言模型通过一种称为 ReAct（推理与行动相结合）的推理技术来推理和找出解决用户请求的步骤和方法（工具）。使用 ReAct，您可以构建结构化的提示词来向基础模型展示如何通过任务进行推理并决定有助于找到解决方案的行动（Action）。结构化的提示词包括一系列的对于“提问-思考-行动-观察”这个 ReAct 过程的示例。其中“提问”是要解决的用户问题。“思考”是一个推理步骤，有助于向基础模型演示如何应对问题并确定要采取的行动。“行动”是模型可以从一组允许的工具中调用对应的 API。“观察”是获得执行特定 API（Action）的返回结果。

以上这个过程已经包装在了 Bedrock Agent 的实现当中，Agent 的用户只需要定义和实现可以供大模型挑选和使用的 Action 即可，一次和 Agent 的对话过程如下图所示：

用户提出一个问题：“我的 us-east-1 区域有多少 EBS 卷？”
这个问题被发送到由 Claude3 提供支持的 Bedrock Agent。
Bedrock Agent 利用大模型分析问题并与“Action Group”中的 OpenAPI 规范互动，寻找适当的 API 路径和参数。
Bedrock Agent 按照 OpenAPI 规范提供选定的 API 路径和参数给到 Lambda 函数。
Lambda 函数使用指定的 API 调用 boto3 的 EBS 接口，获取给定区域的真实的 EBS 卷的 ID 列表，并返回结果给 Bedrock Agent。
Bedrock Agent 借助大模型将结果包装成用户更可读的文本形式。
用户接收到最终的回复：“你有以下 EBS 卷：[xxxx,xxxx,xxxx]”。

技术实现

为了实现基本的 EBS 运维功能，我们需要至少实现三个 Action：罗列 EBS 卷，针对给定的卷显示详细信息，和修改卷的类型。这三个功能需要用 openapi 的格式定义出他们的 api 名称，输入参数和输出的格式等，这样的 openapi schema 可以通过 Claude3 大模型进行初稿的生成，然后再根据实际情况进行调整，最终的内容如下：

openapi: 3.0.3
info:
  title: AWS EBS Service API
  description: API for managing AWS EBS volumes
  version: 1.0.0
servers:
  - url: https://api.example.com/v1
paths:
  /volumes:
    get:
      summary: List all EBS volumes in a region
      description: This endpoint retrieves a list of all EBS volume IDs in a specified region.
      parameters:
        - name: region
          in: query
          required: true
          schema:
            type: string
          description: The name of the region to list EBS volumes
      responses:
        '200':
          description: A list of EBS volume IDs
          content:
            application/json:
              schema:
                type: array
                items:
                  type: string
        '400':
          description: Invalid region name

  /volume_id:
    get:
      summary: Get the current status of an EBS volume
      description: This endpoint retrieves the current status information of a specific EBS volume, including volume ID, type, and size.
      parameters:
        - name: region
          in: query
          required: true
          schema:
            type: string
          description: The name of the region
        - name: volumeId
          in: path
          required: true
          schema:
            type: string
          description: The ID of the EBS volume
      responses:
        '200':
          description: The current status of the EBS volume
          content:
            application/json:
              schema:
                type: object
                properties:
                  volumeId:
                    type: string
                  volumeType:
                    type: string
                  volumeSize:
                    type: integer
                  volumeState:
                    type: string
        '400':
          description: Invalid volume ID or region name

  /volume_change_type:
    post:
      summary: Change the type of an EBS volume
      description: This endpoint changes the type of a specified EBS volume. The operation is asynchronous, and the response indicates whether the command was successfully sent.
      parameters:
        - name: region
          in: query
          required: true
          schema:
            type: string
          description: The name of the region
        - name: volumeId
          in: path
          required: true
          schema:
            type: string
          description: The ID of the EBS volume
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                originalType:
                  type: string
                  description: The original type of the EBS volume
                targetType:
                  type: string
                  description: The target type to which the EBS volume should be changed
      responses:
        '202':
          description: Command accepted and being processed asynchronously
        '400':
          description: Invalid volume ID or type information

Amazon Bedrock Agent 在创建的时候，需要配置一个叫 Action Group 的属性。Action Group 顾名思义，就是可以被大模型用来调用的动作组，每一个 Action Group 需要通过上面的 Openapi 格式定义好可以支持的 API 列表（Action Group Schema），同时需要绑定一个 Lambda 函数作为 Action 的具体实现。你可以让 Agent 给你自动创建一个 Lambda 函数，在这个基础上完善你需要的功能实现。如下图：

这里定义的 Lambda 函数：ebs-operations-gffka，就是具体的每个 api 的实现逻辑，具体的代码如下：

import json
import boto3

def list_volumes(region):
    ec2_client = boto3.client('ec2', region_name=region)
    volumes = ec2_client.describe_volumes()
    volume_ids = [volume['VolumeId'] for volume in volumes['Volumes']]
    return volume_ids

def get_volume_details(region, volume_id):
    ec2_client = boto3.client('ec2', region_name=region)
    response = ec2_client.describe_volumes(VolumeIds=[volume_id])
    volume_details = response['Volumes'][0]
    return volume_details
    
def modify_volume_type(region, volume_id, original_type, target_type):
    ec2_client = boto3.client('ec2', region_name=region)
    
    response = ec2_client.modify_volume(VolumeId=volume_id, VolumeType=target_type)
    task_status = {
                'message': 'Volume modification initiated successfully.',
                'modificationState': response['VolumeModification']['ModificationState'],
                'targetType': target_type,
                'originalType': original_type
    }
    


def lambda_handler(event, context):
    agent = event['agent']
    actionGroup = event['actionGroup']
    apiPath = event['apiPath']
    httpMethod =  event['httpMethod']
    parameters = event.get('parameters', [])
    requestBody = event.get('requestBody', {})
    print(event)

    if apiPath == "/volumes" and httpMethod == "GET":
        region = next((param['value'] for param in parameters if param['name'] == 'region'), None)
        if region:
            volumes = list_volumes(region)
            responseBody = {
                "application/json": {
                    "body": volumes
                }
            }
            httpStatusCode = 200
        else:
            responseBody = {
                "application/json": {
                    "body": "Invalid region name"
                }
            }
            httpStatusCode = 400
            
    elif apiPath == "/volume_id" and httpMethod == "GET":
        region = next((param['value'] for param in parameters if param['name'] == 'region'), None)
        volume_id = next((param['value'] for param in parameters if param['name'] == 'volumeId'), None)
        if region is not None and volume_id is not None:
            volume_details = get_volume_details(region, volume_id)
            print(volume_details)

            volume_info = {
                "volumeId": volume_details["VolumeId"],
                "volumeType": volume_details["VolumeType"],
                "volumeSize": volume_details["Size"],
                "volumeState": volume_details["State"]
            }
            responseBody = {
                "application/json": {
                    "body": volume_info
                }
            }
            httpStatusCode = 200
        else:
            responseBody = {
                "application/json": {
                    "body": "Invalid region name or volume id"
                }
            }
            httpStatusCode = 400
    
    elif apiPath == "/volume_change_type" and httpMethod == "POST":
        region = next((param['value'] for param in parameters if param['name'] == 'region'), None)
        volume_id = next((param['value'] for param in parameters if param['name'] == 'volumeId'), None)
        properties = requestBody['content']['application/json']['properties']
        # Initialize variables to hold the values
        original_type = None
        target_type = None
        
        # Loop through the properties to find originalType and targetType
        for prop in properties:
            if prop['name'] == 'originalType':
                original_type = prop['value']
            elif prop['name'] == 'targetType':
                target_type = prop['value']

        if region is not None and volume_id is not None:
            task_details = modify_volume_type(region, volume_id, original_type, target_type)
            print(task_details)
            responseBody = {
                "application/json": {
                    "body": task_details
                }
            }
            httpStatusCode = 200
        else:
            responseBody = {
                "application/json": {
                    "body": "Invalid region name or volume id"
                }
            }
            httpStatusCode = 400
    
    else:
        responseBody = {
            "application/json": {
                "body": "Invalid API path or HTTP method"
            }
        }
        httpStatusCode = 400
   

    action_response = {
        'actionGroup': actionGroup,
        'apiPath': apiPath,
        'httpMethod': httpMethod,
        'httpStatusCode': 200,
        'responseBody': responseBody

    }

    dummy_api_response = {'response': action_response, 'messageVersion': event['messageVersion']}
    print("Response: {}".format(dummy_api_response))

    return dummy_api_response

我们可以看到每一个在 Action Group Schema 里定义的 api，Lambda 函数传入的 event 数据结构，获得 apiPath 和 httpMethod 来进行分支判断，同时 event 数据结构里的 parameters 和 requestBody 携带了 api 调用传入的参数和请求体，根据这些信息，就可以在 Lambda 函数里进行具体的功能实现。

为了让 Lambda 函数具备操作 EBS 的权限，还需要保证 Lambda 函数使用的执行角色具备相应的权限：

另外还要注意 Lambda 函数默认的执行超时时间是 3 秒钟，需要根据实际情况设置成大一些的取值，比如 3 分钟。

定义好了 Action Group OpenAPI Schema 和对应的 Lambda 函数实现之后，还需要确认 Agent 使用的大模型，并通过提示词的方式让大模型对自己的角色有一个更好的认知，这里，我们选择了 Claude 3 的 Sonnet 模型，使用的提示词如下：

“你是一个AWS的运维专家，你会根据AWS用户针对自己账户内的资源相关的问题，提供你自己的见解，但是如果AWS用户询问的是自己账户资源的数量，状态等问题，你会调用action group来进行实际信息的获取。如果AWS用户希望你对资源进行增删改等动作，你会先让用户确认，获得确认之后再调用相关的action group来完成。你的输出涉及到和AWS资源相关的信息的时候，你会用json的格式来组织这些内容再输出。”

完成了 Agent 的所有配置，就可以开始测试了。点击下图中的 Test 按钮启动测试：

通过查看 Agent 堆话过程中显示的 Trace 信息，可以看到大模型是如何‘思考’，如何组织程序调用的参数，并利用 Lambda 函数的返回来进一步组织返回给最终用户的消息。

在设计本 Agent 的过程中，我们也考虑到和 Amazon Q 的能力对比。目前在 Amazon Q 的控制台界面，如果你查询某个 region 的 EBS 卷列表，你也会获得实际的信息：

但是如果你希望 Q 替你进一步查询卷的具体信息，或执行运维操作，Q 目前会显示相关的操作和命令建议，但不会真实去执行：

未来扩展

随着技术的进步，Amazon Bedrock Agent 的应用前景令人振奋。未来，Agent 的开发将变得更加简单，得益于自动化和无代码/低代码工具的普及，企业和开发者可以借助 Agent 轻松创建和配置复杂的操作流程。此外，Amazon Bedrock Agent 支持 Claude3 等先进模型的 Function Call 调用，使其能够动态地集成和使用各种 API 和外部服务。这种灵活性将极大地扩展其应用场景，从云运维到客户服务、数据分析、自动化流程等，各个行业都能受益。Agent 的性能也将不断提升，通过高效的任务分解和并行处理，快速响应用户请求。随着更多知识库和工具的集成，Amazon Bedrock 有望成为未来智能自动化的核心驱动力。

*前述特定亚马逊云科技生成式人工智能相关的服务仅在亚马逊云科技海外区域可用，亚马逊云科技中国仅为帮助您了解行业前沿技术和发展海外业务选择推介该服务。