LLM Inference on Amazon EKS

1. 背景介绍

大语言模型（Large Language Model，LLM）是一种基于深度学习技术训练的人工智能模型，具备对自然语言的强大理解和生成能力。近年来，随着计算力和数据量的不断提升，LLM 在自然语言处理领域取得了令人瞩目的进展，展现出了广阔的应用前景。

在企业场景中，LLM 可以被应用于多个领域，例如智能问答、文本摘要、内容创作、代码生成等，为提高工作效率、优化客户体验等带来全新的可能性。越来越多的企业开始探索将 LLM 引入业务环境中。然而，企业在自有环境中部署和运行 LLM 面临诸多挑战：

部署复杂性 – LLM 模型通常规模庞大，需要大量算力资源，部署运维较为困难。
扩展性限制 – 企业难以随着业务发展灵活扩展 LLM 推理服务的计算资源。
可观测性缺失 – 缺乏对 LLM 服务的监控和运维能力，难以保证服务质量。
存储管理成本高 – LLM 模型文件往往体量庞大，分布式存储和管理成本高昂。

如何在企业自有环境中平滑部署并高效运行 LLM，满足业务需求，是当前企业急需解决的问题。

2. 总体架构

为解决企业在自有环境中部署和运行 LLM 面临的诸多挑战，我们提出了一种基于 AWS 云原生服务的解决方案。该解决方案旨在为企业提供一个生产级别的 LLM 推理环境，具备良好的扩展性、可观测性以及存储管理能力。整体架构设计遵循云原生的理念，充分利用了 AWS 的各种托管服务和开源工具，构建了一个可靠、可扩展、易于管理和可观测的 LLM 部署运行平台。架构图如下所示：

我们可以从 4 个层面来看这个架构设计，分别是基础设施、服务网格、应用和可观测性。基础设施层提供了云原生的资源管理能力；服务网格层负责流量管控和部署策略；应用层包含了 LLM 推理的核心功能；而可观测性层则确保了整个平台的可视化和可维护性。每层的具体组件和作用如下：

基础设施层

AWS EKS 作为 Kubernetes 集群的基础承载层
AWS Elastic Load Balancer 提供应用层负载均衡能力
AWS EFS 统一管理 LLM 模型数据持久化存储
AWS Karpenter 实现计算资源的弹性伸缩

服务网格层

Kong 作为 API 网关，实现流量控制和基本认证
Istio Service Mesh 支持灰度/金丝雀发布等

应用层

自研应用网关层处理请求转发及适配
Text Generation WebUI 、vLLM 和 Text Generation Inference 等开源方案作为 LLM 推理引擎（算力单元）

可观测性层

Prometheus/Grafana/Loki 实现指标和日志监控
KubeSphere和KubeCost 提供集群管理和费用管控能力

使用该方案用户只需专注于 LLM 模型的选型和应用，底层的基础设施和运维管理完全由解决方案自动化处理，大幅降低了 LLM 服务的运维复杂度。同时该方案具备水平扩展能力，能够随时根据业务发展灵活扩缩容 LLM 推理资源，保证高性能和高可用，还提供可观测性能力，确保服务质量，统一的 LLM 模型存储也使得存储管理更加便捷高效。

整体部署方案代码和配置已经发布在 Github 仓库：https://github.com/GlockGao/llm-on-eks.git

3. 创新点

作为大语言模型在企业级场景落地的先驱性实践，我们的解决方案在多个技术层面做出了创新，以确保 LLM 服务的高性能、高可靠以及良好的运维体验：

1. 算力单元支持多种开源框架，包括 Text Generation WebUI、vLLM 和 Text Generation Inference 等，并且对开源 LLM 框架 Text Generation WebUI 进行改造，支持在 Kubernetes 环境下运行

原生的 Text Generation WebUI 项目是为单机环境设计的，不支持在 Kubernetes 集群中部署运行。我们对其进行了深度定制化改造，使其能在 Kubernetes 环境下顺利部署和运行。主要的改造工作包括：

封装成 Docker 容器化应用，实现无状态部署
将模型数据持久化到外部存储（AWS EFS），解除和宿主机的耦合
优化配置加载和资源请求方式，适配 Kubernetes 的调度机制
修改日志输出，与 Kubernetes 日志收集组件对接

2. 支持利用 AWS Neuron 芯片加速 LLM 推理

大语言模型的计算复杂度很高，对算力需求极大。利用 AWS 推出的 Neuron 芯片（Inferentia & Trainium），可以大幅降低推理的延迟和成本。然而原生的 Text Generation WebUI 并不支持直接使用这种加速芯片。我们对其进行了以下关键改造：

修改模型加载模块，添加新的加载器以支持 Neuron SDK
重构推理逻辑，使用 Neuron 的 Python API 进行推理计算
优化显存使用策略，充分发挥 Neuron 芯片的算力

3. 构建统一的 LLM 应用网关层

为了实现 LLM 服务的高可用、可扩展，我们自研了一个应用网关层，作为与 LLM 推理引擎的适配层。该网关层由 Golang 语言编写，基于 Fiber 框架构建，充当反向代理，接收客户端请求并进行协议转换和负载分发。网关层的主要创新点有：

内置服务发现机制，自动探索可用的 LLM 推理实例
实现请求级别的负载均衡和故障转移策略
支持限流、认证等网关常见功能
提供统一的指标和日志输出，与监控系统集成

4. 实施步骤

4.0 前提条件

安装 kubectl

ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
eksctl version

安装 eksctl

# 下载1.28版本
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.28.5/2024-01-04/bin/linux/amd64/kubectl

# 下载验证文件
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.28.5/2024-01-04/bin/linux/amd64/kubectl.sha256

# 验证
sha256sum -c kubectl.sha256

# 更改执行
chmod +x ./kubectl

mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH

# 设置PATH环境变量
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc

# 查看版本
kubectl version --client

安装 helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
helm version

安装 AWS CLI、配置 AWS 权限
创建 EKS 集群

（1）集群创建配置文件 eks-cluster.yaml，下述配置以 us-west-2、1.28 版本为例，可以根据需要自行配置

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: eks-cluster-prod
  region: us-west-2
  version: "1.28"

kubernetesNetworkConfig:
  ipFamily: IPv4

managedNodeGroups:
- name: managed-node
  labels:
    role: co-worker
  instanceType: c6i.large
  minSize: 1
  desiredCapacity: 1
  maxSize: 3

（2）创建集群命令

eksctl create cluster -f eks-cluster.yaml

4.1 准备基础环境

4.1.1 安装 AWS LoadBalancer Controller

创建 OIDC Provider，需配置 CLUSTER_NAME

export cluster_name=${CLUSTER_NAME}
eksctl utils associate-iam-oidc-provider --cluster $cluster_name --approve

获取并创建 IAM Policy

# 获取IAM Policy
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.0/docs/install/iam_policy.json

# 创建IAM Policy
aws iam create-policy \
    --policy-name AWSLoadBalancerControllerIAMPolicy \
    --policy-document file://iam-policy.json

创建 Service Account，需配置 CLUSTER_NAME、REGION 和 ACCOUNT_ID 等信息

export cluster_name=${CLUSTR_NAME}
export region=${REGION}
export AWS_ACCOUNT_ID=${ACCOUNT_ID}

eksctl create iamserviceaccount \
  --cluster=$cluster_name \
  --namespace=kube-system \
  --name=aws-load-balancer-controller \
  --attach-policy-arn=arn:aws:iam::$AWS_ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
  --override-existing-serviceaccounts \
  --region $region \
  --approve

添加并 helm repo

helm repo add eks https://aws.github.io/eks-charts
helm repo update eks

安装 AWS Load Balancer Controller

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=$cluster_name \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

4.1.2 安装 EFS Driver

下载并创建 IAM Policy

curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/master/docs/iam-policy-example.json
aws iam create-policy \
    --policy-name EKS_EFS_CSI_Driver_Policy \
    --policy-document file://iam-policy-example.json

创建 Service Account，需配置 CLUSTER_NAME 信息

export cluster_name=${CLUSTER_NAME}
export role_name=AmazonEKS_EFS_CSI_DriverRole
eksctl create iamserviceaccount \
    --name efs-csi-controller-sa \
    --namespace kube-system \
    --cluster $cluster_name \
    --role-name $role_name \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy \
    --approve
TRUST_POLICY=$(aws iam get-role --role-name $role_name --query 'Role.AssumeRolePolicyDocument' | \
    sed -e 's/efs-csi-controller-sa/efs-csi-*/' -e 's/StringEquals/StringLike/')
aws iam update-assume-role-policy --role-name $role_name --policy-document "$TRUST_POLICY"

添加并更新 helm repo

# 1. 添加helm repo
helm repo add aws-efs-csi-driver https://kubernetes-sigs.github.io/aws-efs-csi-driver/

# 2. 更新helm repo
helm repo update aws-efs-csi-driver

安装 EFS Driver

helm upgrade --install aws-efs-csi-driver --namespace kube-system aws-efs-csi-driver/aws-efs-csi-driver \
  --set controller.serviceAccount.create=false \
  --set controller.serviceAccount.name=efs-csi-controller-sa

安装 EFS Storage Class

（1）配置文件 efs-sc.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com

（2）部署命令
```
kubectl apply -f efs-sc.yaml
```

4.1.3 安装 EBS Driver

创建 Service Account，需配置 CLUSTER_NAME 和 ACCOUNT_ID 信息

export cluster_name=${CLUSTER_NAME}
export AWS_ACCOUNT_ID=${ACCOUNT_ID}

eksctl create iamserviceaccount \
    --name ebs-csi-controller-sa \
    --namespace kube-system \
    --cluster $cluster_name \
    --role-name AmazonEKS_EBS_CSI_DriverRole \
    --role-only \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
    --approve

安装 EBS Driver

eksctl create addon --name aws-ebs-csi-driver --cluster $cluster_name --service-account-role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole  --force

安装 EBS Storage Class

（1）配置文件 ebs-sc.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

（2）安装命令
```
kubectl apply -f gp3-sc.yaml
```

4.1.4 安装 Karpenter

Karpenter 主要用于 EKS 集群的 CA 扩展
Karpenter 的安装主要包括两种方式：直接安装（适用于创建全新集群）和 Migration（适用于集群已经存在）
文章中采用第二种方式安装 Karpenter，具体步骤可参考官方链接：https://karpenter.sh/docs/getting-started/migrating-from-cas/

集群安装完成后需创建 NodePool 以指导 Karpenter 如何扩缩集群，示例如下

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        name: default
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2 # Amazon Linux 2
  role: "KarpenterNodeRole-eks-cluster-prod" # replace with your cluster name
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: eks-cluster-prod # replace with your cluster name
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: eks-cluster-prod # replace with your cluster name

4.2 准备控制面环境

4.2.1 安装 KubeSphere

下载安装文件

wget https://github.com/kubesphere/ks-installer/releases/download/v3.4.1/kubesphere-installer.yaml
wget https://github.com/kubesphere/ks-installer/releases/download/v3.4.1/cluster-configuration.yaml

安装命令

kubectl apply -f kubesphere-installer.yaml

监控安装进程

kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l 'app in (ks-install, ks-installer)' -o jsonpath='{.items[0].metadata.name}') -f

查看安装结果
```
kubectl get svc -n kubesphere-system
```

查看 UI 页面：对应 ks-console 服务

4.2.2 安装 KubeCost

安装命令

helm upgrade -i kubecost oci://public.ecr.aws/kubecost/cost-analyzer --version 2.0.2 \
    --namespace kubecost --create-namespace \
    -f https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/develop/cost-analyzer/values-eks-cost-monitoring.yaml

查看安装结果
```
kubectl get svc -n kubecost
```

查看 UI 页面：对应 kubecost-cost-analyzer 服务

4.2.3 安装 Prometheus 和 Grafana

添加并更新 helm repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

安装 Prometheus Stack

# 1. 创建命名空间 - monitoring
kubectl create ns monitoring

# 2. 安装prometheus stack
helm install eks-kube-prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring

查看安装情况
```
kubectl get svc -n monitoring
```

查看 UI 页面

- （1）Prometheus
- （2）Grafana

4.2.4 安装 Loki

Loki 架构如下，可以理解为类 Prometheus，只不过存储的是日志而不是指标，Loki 非常轻量级且适配 k8s，数据可以存储到 AWS S3 实现存算分离架构

添加并更新 helm repo

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

创建配置文件 loki-custom-values.yaml，示例如下，可以根据具体情况进行配置

loki:
  auth_enabled: false
  storage:
    type: "s3"
    s3:
      region: "us-west-2"
      accessKeyId: "xxx"
      secretAccessKey: "xxx"
    bucketNames:
      chunks: "loki-chunks"
      ruler: "loki-ruler"
      admin: "loki-admin"
  commonConfig:
    replication_factor: 1

read:
  persistence:
    storageClass: ebs-sc
  replicas: 1

write:
  persistence:
    storageClass: ebs-sc
  replicas: 1

backend:
  persistence:
    storageClass: ebs-sc
  replicas: 1

gateway:
  enabled: true
  basicAuth:
      enabled: false

安装 Loki，此处选择安装 5.42.2 版本

helm upgrade loki --values loki-custom-values.yaml --namespace loki grafana/loki --version 5.42.2

安装 Promtail

helm install promtail --namespace loki grafana/promtail

4.3 准备数据面环境

4.3.1 安装 Kong

安装 Gateway 和 GatewaClass

（1）配置文件 – gateway.yaml

---
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: kong
  annotations:
    konghq.com/gatewayclass-unmanaged: 'true'

spec:
  controllerName: konghq.com/kic-gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: kong
spec:
  gatewayClassName: kong
  listeners:
  - name: proxy
    port: 80
    protocol: HTTP

（2）安装命令

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml
kubectl apply -f gateway.yaml

添加并更新 helm repo

helm repo add kong https://charts.konghq.com
helm repo update

安装 Kong

helm install kong kong/ingress. -n kong --create-namespace

4.3.2 部署服务网格

添加 helm repo 并更新

helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo update

创建命名空间
```
kubectl create namespace istio-system
```

安装 istio-base

helm install istio-base istio/base -n istio-system --set defaultRevision=default

安装 istio discovery

helm install istiod istio/istiod -n istio-system --wait

安装 ingress gateway

（1）配置文件 ingressgateway.yaml

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-attributes: "load_balancing.cross_zone.enabled=true"

（2）安装命令

helm install istio-ingressgateway istio/gateway -n istio-system -f ingressgateway.yaml

安装 istio 对应的 add-on

for ADDON in kiali jaeger prometheus grafana
do
    ADDON_URL="https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/$ADDON.yaml"
    kubectl apply -f $ADDON_URL
done

使用 Kiali 查看部署的应用：此处部署的 Service 仅作演示目的

4.3.3 部署应用网关

应用官方是使用 Fiber 框架、Go 语言开发的应用网关，项目 github（后续集成到 aws-examples 中并开源）地址：https://github.com/GlockGao/go-fiber-gateway
应用网关可以响应客户端请求、适配后端算力单元（例如 Text Generation WebUI + Nvidia GPU）组成的大语言模型推理框架、记录 API 调用日志和指标并以此实现算力单元的自动扩缩
下载代码后编译镜像
```
./scripts/build_and_push.sh
```
部署 Service，需要根据情况更改 Image

（1）配置文件 service-text-generation-webui-proxy.yaml

---
apiVersion: v1
kind: Service
metadata:
  name: text-generation-webui-proxy # Service名称
  labels:
    app: text-generation-webui-proxy    # Service自身标签
spec:
  ports:
  - port: 3000  # K8S集群内部访问Service时使用的端口
    protocol: TCP
    targetPort: 3000  # 目标Pod的监听端口
    name: http
  selector:
    app: text-generation-webui-proxy
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: text-generation-webui-proxy
  name: text-generation-webui-proxy
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 10  # 滚动更新后, 保留的历史版本数
  selector: # 找到匹配的RS
    matchLabels:
      app: text-generation-webui-proxy
  strategy: # 更新策略
    rollingUpdate:
      maxSurge: 25% 
      maxUnavailable: 25%
    type: RollingUpdate # 更新类型, 滚动更新
  template:
    metadata:
      labels:
        app: text-generation-webui-proxy
    spec:
      containers:
      - image: xxxx
        imagePullPolicy: IfNotPresent
        name: text-generation-webui-proxy
      restartPolicy: Always
      terminationGracePeriodSeconds: 30

（2）部署命令

kubectl apply -f service-text-generation-webui-proxy.yaml

查看部署的 Service

kubectl get all -l app=text-generation-webui-proxy

配置 Kong Route，使得 Kong 接收到的请求转发给该应用网关达到暴露服务的目的

（1）配置文件 kong-route-ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: text-generation-webui-proxy
  annotations:
    konghq.com/strip-path: 'true'
spec:
  ingressClassName: kong
  rules:
  - http:
      paths:
      - path: /
        pathType: ImplementationSpecific
        backend:
          service:
            name: text-generation-webui-proxy
            port:
              number: 3000

（2）命令

kubectl apply -f kong-route-ingress.yaml

4.3.4 部署 Text Generation WebUI

Text Generation WebUI 是最近比较热门的开源项目，目标是 LLM 界的 SD-WebUI，目前部分客户已经使用其作为 LLM 的推理框架
原生 Text Generation WebUI 不支持 K8S 和 AWS Neuron 芯片进行部署。因此该方案针对上述两个方面进行了优化，使得 Text Generation WebUI 可以部署于 K8S 且对部分支持的模型可以使用 Neuron 芯片作为推理引擎
Neuron 芯片支持，核心修改如下：

（1）modules/models.py 加入 neuron_loader()以支持使用 Neuron 芯片加载模型

def neuron_loader(model_name):
    from transformers_neuronx.llama.model import LlamaForSampling

    path_to_model = Path(f'{shared.args.model_dir}/{model_name}/model')
    path_to_neuron = Path(f'{shared.args.model_dir}/{model_name}/neuron_artifacts')
    path_to_tokenizer = Path(f'{shared.args.model_dir}/{model_name}/tokenizer')

    model = LlamaForSampling.from_pretrained(path_to_model, batch_size=1, tp_degree=12, amp='f16')
    model.load(path_to_neuron) # Load the compiled Neuron artifacts
    model.to_neuron() # will skip compile

    tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer)

    return model, tokenizer
  

def load_model(model_name, loader=None):
    logger.info(f"Loading {model_name}")
    t0 = time.time()

    shared.is_seq2seq = False
    shared.model_name = model_name
    load_func_map = {
        'Transformers': huggingface_loader,
        'AutoGPTQ': AutoGPTQ_loader,
        'GPTQ-for-LLaMa': GPTQ_loader,
        'llama.cpp': llamacpp_loader,
        'llamacpp_HF': llamacpp_HF_loader,
        'RWKV': RWKV_loader,
        'ExLlama': ExLlama_loader,
        'ExLlama_HF': ExLlama_HF_loader,
        'ExLlamav2': ExLlamav2_loader,
        'ExLlamav2_HF': ExLlamav2_HF_loader,
        'ctransformers': ctransformers_loader,
        'AutoAWQ': AutoAWQ_loader,
        'QuIP#': QuipSharp_loader,
        'HQQ': HQQ_loader,
        'Neuron': neuron_loader
    }

（2）modules/models_setting.py，修改模型设置以支持采用 neuron_loader 进行模型加载

def infer_loader(model_name, model_settings):
    path_to_model = Path(f'{shared.args.model_dir}/{model_name}')
    if not path_to_model.exists():
        loader = None
    elif (path_to_model / 'quantize_config.json').exists() or ('wbits' in model_settings and type(model_settings['wbits']) is int and model_settings['wbits'] > 0):
        loader = 'ExLlama_HF'
    elif (path_to_model / 'quant_config.json').exists() or re.match(r'.*-awq', model_name.lower()):
        loader = 'AutoAWQ'
    elif len(list(path_to_model.glob('*.gguf'))) > 0:
        loader = 'llama.cpp'
    elif re.match(r'.*\.gguf', model_name.lower()):
        loader = 'llama.cpp'
    elif re.match(r'.*rwkv.*\.pth', model_name.lower()):
        loader = 'RWKV'
    elif re.match(r'.*exl2', model_name.lower()):
        loader = 'ExLlamav2_HF'
    elif re.match(r'.*-hqq', model_name.lower()):
        return 'HQQ'
    elif re.match(r'.*-neuron', model_name.lower()):
        return 'Neuron'
    else:
        loader = 'Transformers'

    return loader

（3）modules/text_generation.py，修改 encode()函数，不使用 CUDA device 加载

def encode(prompt, add_special_tokens=True, add_bos_token=True, truncation_length=None):
    ...
    return input_ids

（4）modules/text_generation.py，修改 generate_reply_HF()函数，使用 Neuron SDK 进行模型推理

def generate_with_callback(callback=None, *args, **kwargs):
	neuron_kwargs = dict()
    neuron_kwargs['input_ids'] = kwargs['inputs']
    neuron_kwargs['top_k'] = kwargs['top_k']
    neuron_kwargs['top_p'] = kwargs['top_p']
    neuron_kwargs['temperature'] = kwargs['temperature'] 
    neuron_kwargs['eos_token_override']=kwargs['eos_token_id']
    neuron_kwargs['sequence_length']=kwargs['max_new_tokens']

    kwargs['stopping_criteria'].append(Stream(callback_func=callback))

    clear_torch_cache()
    with torch.inference_mode():
    	shared.model.sample(neuron_kwargs)

K8S 支持

（1）生成 Dockerfile 文件

# BUILDER
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 as builder
WORKDIR /builder
ARG TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST:-3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX}"
ARG BUILD_EXTENSIONS="${BUILD_EXTENSIONS:-}"
ARG APP_UID="${APP_UID:-6972}"
ARG APP_GID="${APP_GID:-6972}"

RUN apt update && \
    apt install --no-install-recommends -y git vim build-essential python3-dev pip bash curl net-tools && \
    rm -rf /var/lib/apt/lists/*
WORKDIR /home/app/
# RUN git clone https://github.com/oobabooga/text-generation-webui.git
COPY . /home/app/text-generation-webui
WORKDIR /home/app/text-generation-webui
RUN GPU_CHOICE=A USE_CUDA118=FALSE LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=TRUE ./start_linux.sh --verbose
# COPY CMD_FLAGS.txt /home/app/text-generation-webui/
EXPOSE ${CONTAINER_PORT:-7860} ${CONTAINER_API_PORT:-5000} ${CONTAINER_API_STREAM_PORT:-5005}
WORKDIR /home/app/text-generation-webui
# set umask to ensure group read / write at runtime
CMD umask 0002 && export HOME=/home/app/text-generation-webui && ./start_linux.sh

（2）生成镜像并上传 – bash 脚本

#!/usr/bin/env bash

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
# The name of our algorithm
algorithm_name=text-generation-webui

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

部署服务

（1）配置文件（service-text-generation-webui.yaml）：根据具体情况修改 Image 和 EFS Claim

---
apiVersion: v1
kind: Service
metadata:
  name: text-generation-webui # Service名称
  namespace: default
  labels:
    app: text-generation-webui    # Service自身标签
spec:
  ports:
  - port: 5000  # K8S集群内部访问Service时使用的端口
    protocol: TCP
    targetPort: 5000  # 目标Pod的监听端口
    name: api
  - port: 5005
    protocol: TCP
    targetPort: 5005
    name: api-stream
  - port: 7860
    protocol: TCP
    targetPort: 7860
    name: web
  selector:
    app: text-generation-webui
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: text-generation-webui
  name: text-generation-webui
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 10  # 滚动更新后, 保留的历史版本数
  selector: # 找到匹配的RS
    matchLabels:
      app: text-generation-webui
  strategy: # 更新策略
    rollingUpdate:
      maxSurge: 25% 
      maxUnavailable: 25%
    type: RollingUpdate # 更新类型, 滚动更新
  template:
    metadata:
      labels:
        app: text-generation-webui
    spec:
      tolerations:
        - key: "gpu-load"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      containers:
      - image: xxx
        imagePullPolicy: IfNotPresent
        name: text-generation-webui
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: persistent-storage-for-models
            mountPath: /home/app/text-generation-webui/models
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      volumes:
      - name: persistent-storage-for-models
        persistentVolumeClaim:
          claimName: efs-claim-text-generation-webui

（2）将模型文件下载并拷贝到 EFS 对应路径，例如截图中是”TheBloke_Llama-2-7B-Chat-AWQ”模型

（3）部署命令

kubectl apply -f service-text-generation-webui.yaml

4.4 集群自动扩缩

大语言模型推理的自动扩缩是一个比较复杂的问题，需要考虑方方面面的因素，包括但不限于底层的算力机类型、模型的大小、模型是否量化、输入和输出 token 数量、推理的参数设定以及特定框架的参数等，因此设计一个适配各种场景的完美方案非常困难。比较合理的扩缩指标信息包括“单位时间请求数”、“请求响应时长”等等

本方案中采用“单位时间请求数”作为 Kubernetes deployment 扩缩的依据，下面以 Text Generation Inference 为例讲解，配置文件如下。由于只是作为展示，因此设置的扩展指标非常敏感 – 每 2 秒有一个请求就会扩展。生产中应根据实际情况配置

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2
metadata:
  name: llama2-13b-chat-awq
  namespace: tgi
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tgi-engine-llama2-13b-chat-awq
  minReplicas: 1
  maxReplicas: 3
  metrics:
  # use a "Pods" metric, which takes the average of the
  # given metric across all pods controlled by the autoscaling target
  - type: Pods
    pods:
      metric:
        name: tgi_request_per_second
      # target 500 milli-requests per second,
      # which is 1 request every two seconds
      target:
        type: Value
        averageValue: 500m
  behavior: # 这里是重点
    scaleDown:
      stabilizationWindowSeconds: 300 # 需要缩容时，先观察5分钟，如果一直持续需要缩容才执行缩容
      policies:
      - type: Percent
        value: 100 # 允许全部缩掉
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0 # 需要扩容时，立即扩容
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15 # 每15s最大允许扩容当前1倍数量的Pod
      - type: Pods
        value: 4
        periodSeconds: 15 # 每15s最大允许扩容 4 个 Pod
      selectPolicy: Max # 使用以上两种扩容策略中算出来扩容Pod数量最大的

查看相关 Deployment 初始配置 – 此时没有任何请求，初始值为 0
当压力增加，自动扩展 deploy 的 pod 数量，同时通过 Karpenter 出发点集群节点的扩展
当压力下降持续 5 分钟后，pod自动进行收缩，同时通过 Karpenter 触发集群节点的收缩

4.5 方案验证

目前方案提供了 HTTP 接口（OpenAI-Compatible）以便于调用 LLM 推理能力

调用命令 v1/completions/接口

curl http://127.0.0.1:5000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a cake recipe:\n\n1.",
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'

调用截图

5. 总结

通过这个基于 AWS 云原生服务的解决方案，我们为企业在自有环境中平滑部署和高效运行大型语言模型提供了一种创新的实践方式。该解决方案遵循云原生理念，融合了多种 AWS 基础服务和开源工具，构建了一个功能完备、灵活可扩展、易于运维的 LLM 部署运行平台。

在技术层面，我们将开源 LLM 框架 Text Generation WebUI、vLLM 和 Text Generation Inference 部署于 Amazon EKS 集群实现 LLM 的推理，同时对开源 LLM 框架 Text Generation WebUI 进行了多方位的创新改造，使其能够在 Amazon EKS 集群中稳定运行，同时充分利用 AWS Neuron 加速芯片的算力，极大提升了推理性能。另外，我们自研的统一 LLM 应用网关层为服务注入了高可用、负载均衡等生产级能力，形成了一个端到端的 LLM 部署和运维体系，显著降低了企业应用 LLM 能力的复杂度和总体拥有成本。

亚马逊AWS官方博客

LLM Inference on Amazon EKS

1. 背景介绍

2. 总体架构

3. 创新点

4. 实施步骤

4.0 前提条件

4.1 准备基础环境

4.1.1 安装 AWS LoadBalancer Controller

4.1.2 安装 EFS Driver

4.1.3 安装 EBS Driver

4.1.4 安装 Karpenter

4.2 准备控制面环境

4.2.1 安装 KubeSphere

4.2.2 安装 KubeCost

4.2.3 安装 Prometheus 和 Grafana

4.2.4 安装 Loki

4.3 准备数据面环境

4.3.1 安装 Kong

4.3.2 部署服务网格

4.3.3 部署应用网关

4.3.4 部署 Text Generation WebUI

4.4 集群自动扩缩

4.5 方案验证

5. 总结

本篇作者