使用 Amazon SageMaker 微调 WizardCoder 模型

本篇文章主要介绍如何使用 Amazon SageMaker 进行 WizardCoder 模型微调的示例。

这个示例主要包括:

WizardCoder 总体介绍
WizardCoder 微调介绍
WizardCoder 环境设置
WizardCoder 微调训练

前言

随着 ChatGPT 的腾空出世，国内外各种基座大语言竞相出炉，在其基础上衍生出种类繁多的应用场景。训练优异的基座大语言模型在通用性方面表现较好，但模型可能并未涉及到特定领域的专业术语、领域内的特定用语或上下文等。采用微调技术可以通过在领域特定数据上进行训练，使模型更好地适应目标领域的特殊语言模式和结构; 结合基座模型的通用性和领域特定性，使得模型更具实际应用价值。

WizardCoder 总体介绍

AI 代码助手可以有效的帮助程序员提高编程效率、减少错误，并提供智能化的代码建议和优化方案。目前很多团队在使用大语言模型充当 AI 代码助手的角色。然而，大多数现有的模型仅仅是在大量的原始代码数据上进行预训练，而没有进行指令微调。因而 WizardLM 团队研究提出了 WizardCoder，它通过将 Evol-Instruct 方法应用于代码领域，为 Code LLM 提供复杂的指令微调。已经在代码相关任务中取得了卓越的性能。

在 HumanEval、HumanEval+、MBPP 以及 DS1000 四个代码生成基准测试中，WizardCoder 在很大程度上超过了所有其他开源 Code LLM。此外，WizardCoder 在 HumanEval 和 HumanEval + 上的表现甚至超过了最大的闭源 LLM，如 Anthropic 的 Claude 和谷歌的 Bard。

WizardCoder 微调介绍

模型微调主要分为 Full Fine-Tune 和 PEFT（Performance-Efficient Fine-Tune），前者模型全部参数都会进行更新，训练时间较长，训练资源较大; 而后者会冻结大部分参数、微调训练网络结构，常见的方式是 LoRA 和 P-Tuning v2。

PEFT 微调方式由于参数更新较少，可能导致模型无法学习到全部领域知识，对于特定任务或领域来说会出现推理不稳定的情况，因此大多数生产系统均使用全参数方式进行模型的微调。基于上述原因，本文会以全参数微调方式介绍 WizardCoder 在 SageMaker 上的微调。

WizardCoder 环境设置

备注：项目中的示例代码均保存于代码仓库，地址如下: https://github.com/aws-samples/llm-workshop-on-amazon-sagemaker

升级 Python SDK
```
pip install -U sagemaker
```

获取运行时资源，包括区域、角色、账号、S3 桶等

import boto3
import sagemaker
from sagemaker import get_execution_role


sess                     = sagemaker.Session()
role                     = get_execution_role()
sagemaker_default_bucket = sess.default_bucket()

account                  = sess.boto_session.client("sts").get_caller_identity()["Account"]
region                   = sess.boto_session.region_name

WizardCoder 微调训练

微调准备

克隆代码

WizardCoder 基于 LlaMa 架构进行了 Evol-Instruct 方法的指令微调，因此采用 lm-sys 团队发布的 FastChat 平台进行 WizardCoder 的微调，FastChat 也用于训练了知名的 Vicuna 模型，具有良好的代码规范和性能优化

git clone https://github.com/nlpxucan/WizardLM.git
cd WizardLM
git reset --hard 46d1ce7dbbb1f987ae5e5915c75f33b89a6a17ab

cd ../
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
git reset --hard 974537efbd82093b45e64d07904efe7728193a52

下载 WizardCoder 原始模型

from huggingface_hub import snapshot_download
from pathlib import Path


local_cache_path = Path("./model")
local_cache_path.mkdir(exist_ok=True)

model_name = "WizardLM/WizardCoder-15B-V1.0"

# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.model", "*.py"]

model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_cache_path,
    allow_patterns=allow_patterns,
    revision='926ca1b215c4631bc5f8c3e47173381452c23e5c'
)

# Get the model files path
import os
from glob import glob

local_model_path = None

paths = os.walk(r'./model')
for root, dirs, files in paths:
    for file in files:
        if file == 'config.json':
            print(os.path.join(root,file))
            local_model_path = str(os.path.join(root,file))[0:-11]
            print(local_model_path)
if local_model_path == None:
    print("Model download may failed, please check prior step!")

拷贝模型和数据到 S3

chmod +x ./s5cmd
./s5cmd sync ${local_model_path} s3://${sagemaker_default_bucket}/llm/models/wizardcoder/WizardLM/WizardLM-15B/

rm -rf model

模型微调

模型的微调使用全参数模型，以实现微调后模型的稳定性
模型的微调使用开源框架 DeepSpeed 进行加速

准备基础镜像

使用 SageMaker 定制的深度学习训练镜像作为基础镜像，再安装WizardCoder训练所需的依赖包。Dockerfile 如下：

%%writefile Dockerfile
## You should change below region code to the region you used, here sample is use us-west-2
From 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04 


ENV LANG=C.UTF-8
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

RUN pip3 uninstall -y deepspeed \
    && pip3 install deepspeed==0.10.0 \
    && pip3 install transformers==4.30.1 \
    && pip3 install accelerate==0.21.0

## Make all local GPUs visible
ENV NVIDIA_VISIBLE_DEVICES="all"

模型微调代码

模型微调源代码较多，细节可以参考上述 git 仓库。

微调参数

为了节省显存，采用 DeepSpeed Stage-3
训练过程开启 bf16，实现整数范围和精度的平衡
数据集采用官方提供的 alpaca_data.json，也就是典型的{“instruction”、”input”、”output”}的格式

DEEPSPEED_OPTS="""
    WizardLM/WizardCoder/src/train_wizardcoder.py 
    --deepspeed ds.json 
    --model_name_or_path "/tmp/wizardcoder_pretrain/" 
    --data_path WizardLM/WizardCoder/data/alpaca_data.json 
    --output_dir "/tmp/wizardcoder_out" 
    --num_train_epochs 1 
    --per_device_train_batch_size 1 
    --per_device_eval_batch_size  1 
    --gradient_accumulation_steps 4 
    --evaluation_strategy "no" 
    --save_strategy "no" 
    --save_steps 2000 
    --save_total_limit 1 
    --learning_rate 2e-5 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type "cosine" 
    --logging_steps 1 
    --cache_dir '/tmp' 
    --model_max_length 512 
    --gradient_checkpointing True 
    --bf16 True 
    --tf32 True 
    --report_to "none"
"""

微调脚本

微调使用 torchrun + DeepSpeed 进行分布式训练

%%writefile ./src/ds-train-dist.sh
#!/bin/bash
CURRENT_HOST="${SM_CURRENT_HOST}"


IFS=',' read -ra hosts_array <<< "${SM_HOSTS}"
NNODES=${#hosts_array[@]}
NODE_RANK=0

for i in "${!hosts_array[@]}"; do
    if [[ "${hosts_array[$i]}" == *${CURRENT_HOST}* ]]; then
        echo "host index：$i"
        NODE_RANK="$i" 
    fi
done
   
    
MASTER_PORT="13579"
export NCCL_SOCKET_IFNAME="eth0"

#Configure the distributed arguments for torch.distributed.launch.
GPUS_PER_NODE="$SM_NUM_GPUS"
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \
                  --nnodes $NNODES \
                  --node_rank $NODE_RANK \
                  --master_addr $MASTER_ADDR \
                  --master_port $MASTER_PORT"

chmod +x ./s5cmd
./s5cmd sync s3://$MODEL_S3_BUCKET/llm/models/wizardcoder/WizardLM/WizardLM-15B/* /tmp/wizardcoder_pretrain/

CMD="torchrun ${DISTRIBUTED_ARGS} ${DEEPSPEED_OPTS}"
echo ${CMD}
${CMD} 2>&1 

if [[ "${CURRENT_HOST}" == "${MASTER_ADDR}" ]]; then  
    ./s5cmd sync /tmp/wizardcoder_out s3://$MODEL_S3_BUCKET/llm/models/wizardcoder/output/WizardLM/WizardLM-15B/$(date +%Y-%m-%d-%H-%M-%S)/
fi

启动微调

全参数微调，需要使用至少一台 p4de.12xlarge（8卡 A100 40GB）作为训练机器

当微调完成后，训练好的模型自动存储于指定的 S3 桶内，可用于后续的模型部署推理

import time
from sagemaker.estimator import Estimator

environment = {
              'MODEL_S3_BUCKET': sagemaker_default_bucket # The bucket to store pretrained model and fine-tune model
}

base_job_name = 'wizardcoder-15b-finetune'

instance_type = 'ml.p4d.24xlarge'

estimator = Estimator(role=role,
                      entry_point='ds-train-dist.sh',
                      source_dir='./src',
                      base_job_name=base_job_name,
                      instance_count=1,
                      instance_type=instance_type,
                      image_uri=image_uri,
                      environment=environment,
                      disable_profiler=True,
                      debugger_hook_config=False)


estimator.fit()

总结

大语言模型方兴未艾，正在以各种方式改变和影响着整个世界。客户拥抱大语言模型，亚马逊云科技团队同样在深耕客户需求和大语言模型技术，可以在未来更好的协助客户实现需求、提升业务价值。

亚马逊AWS官方博客