Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability.

In this post, we will walk you through how you can quickly deploy Meta’s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance. For this example, we will use the 1B version, but other sizes can be deployed using these steps, along with other popular LLMs.

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Meta’s newest Llama 3.2 model. You will learn how to request access to the model, create a Docker container to use vLLM to deploy the model and how to run online and offline inference on the model. We will also talk about performance tuning the inference graph.

Prerequisite: Hugging Face account and model access

To use the meta-llama/Llama-3.2-1B model, you’ll need a Hugging Face account and access to the model. Please go to the model card, sign up, and agree to the model license. You will then need a Hugging Face token, which you can get by following these steps. When you get to the Save your Access Token screen, as shown in the following figure, make sure you copy the token because it will not be shown again.

Create an EC2 instance

You can create an EC2 Instance by following the guide. A few things to note:

If this is your first time using inf/trn instances, you will need to request a quota increase.
You will use inf2.xlarge as your instance type. inf2.xlarge instances are only available in these AWS Regions.
Increase the gp3 volume to 100 G.
You will use Deep Learning AMI Neuron (Ubuntu 22.04) as your AMI, as shown in the following figure.

After the instance is launched, you can connect to it to access the command line. In the next step, you’ll use Docker (preinstalled on this AMI) to run a vLLM container image for neuron.

Start vLLM server

You will use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:

cat > Dockerfile <<\EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && \
    apt-get install -y \
        git \
        python3 \
        python3-pip \
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && \
    cd vllm && \
    git checkout v0.6.2 && \
    python3 -m pip install -U \
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
        -r requirements-neuron.txt && \
    pip install --no-build-isolation -v -e . && \
    pip install --upgrade triton==3.0.0
CMD ["/bin/bash"]
EOF

Then run:

docker build . -t vllm-neuron

Building the image will take about 10 minutes. After it’s done, use the new Docker image (replace YOUR_TOKEN_HERE with the token from Hugging Face):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        vllm-neuron

You can now start the vLLM server with the following command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

This command runs vLLM with the following parameters:

serve meta-llama/Llama-3.2-1B: The Hugging Face modelID of the model that is being deployed for inference.
--device neuron: Configures vLLM to run on the neuron device.
--tensor-parallel-size 2: Sets the number of partitions for tensor parallelism. inf2.xlarge has 1 neuron device and each neuron device has 2 neuron cores.
--max-model-len 4096: This is set to the maximum sequence length (input tokens plus output tokens) for which to compile the model.
--block-size 8: For neuron devices, this is internally set to the max-model-len.
--max-num-seqs 32: This is set to the hardware batch size or a desired level of concurrency that the model server needs to handle.

The first time you load a model, if there isn’t a previously compiled model, it will need to be compiled. This compiled model can optionally be saved so the compilation step is not necessary if the container is recreated. After everything is done and the model server is running, you should see the following logs:

Avg prompt throughput: 0.0 tokens/s ...

This means that the model server is running, but it isn’t yet processing requests because none have been received. You can now detach from the container by pressing ctrl + p and ctrl + q.

Inference

When you started the Docker container, you ran it with the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on your local machine. When you run the following command, you should see that the model server with meta-llama/Llama-3.2-1B is running.

curl localhost:8000/v1/models

This should return something like:

{"object":"list","data":[{"id":"meta-llama/Llama-3.2-1B","object":"model","created":1732552038,"owned_by":"vllm","root":"meta-llama/Llama-3.2-1B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-6d44a6f6e52447eb9074b13ae1e9e285","object":"model_permission","created":1732552038,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}ubuntu@ip-172-31-12-216:~$

Now, send it a prompt:

curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

You should get back a response similar to the following from vLLM:

ubuntu@ip-172-31-13-178:~$ curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
  % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                 Dload  Upload   Total   Spent  Left  Speed
100  1067  100   966  100   101    108     11  0:00:09  0:00:08 0:00:01   258
" How does it work?\nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system 
that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.\nGen AI is 
a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new 
situations and environments."

Offline inference with vLLM

Another way to use vLLM on Inferentia is by sending a few requests all at the same time in a script. This is useful for automation or when you have a batch of prompts that you want to send all at the same time.

You can reattach to your Docker container and stop the online inference server with the following:

docker attach $(docker ps --format "{{.ID}}")

At this point, you should see a blank cursor, press ctrl + c to stop the server and you should be back at the bash prompt in the container. Create a file for using the offline inference engine:

cat > offline_inference.py <<EOF
from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="meta-llama/Llama-3.2-1B",
        max_num_seqs=32,
        max_model_len=4096,
        block_size=8,
        device="neuron",
        tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

EOF

Now, run the script python offline_inference.py and you should get back responses for the four prompts. This may take a minute as the model needs to be started again.

Processed prompts: 100%|
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.53it/s, est. speed input: 16.46 toks/s, output: 40.51 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Anna and I am the 4th year student of the Bachelor of Engineering at'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. A'
Prompt: 'The capital of France is', Generated text: ' also the most expensive city to live in. The average cost of living in Paris'
Prompt: 'The future of AI is', Generated text: ' now\nThe 10 most influential AI professionals to watch in 2019\n'

You can now type exit and press return and then press ctrl + c to shut down the Docker container and go back to your inf2 instance.

Clean up

Now that you’re done testing the Llama 3.2 1B LLM, you should terminate your EC2 instance to avoid additional charges.

Performance tuning for variable sequence lengths

You will probably have to process variable length sequences during LLM inference. The Neuron SDK generates buckets and a computation graph that works with the shape and size of the buckets. To fine tune the performance based on the length of input and output tokens in the inference requests, you can set two kinds of buckets corresponding to the two phases of LLM inference through the following environment variables as a list of integers:

NEURON_CONTEXT_LENGTH_BUCKETS corresponds to the context encoding phase. Set this to the estimated length of prompts during inference.
NEURON_TOKEN_GEN_BUCKETS corresponds to the token generation phase. Set this to a range of powers of two within your generation length.

You can use Docker run command to set the environment variables while starting the vLLM server (remember to replace YOUR_TOKEN_HERE with your Hugging Face token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        -e NEURON_CONTEXT_LENGTH_BUCKETS="1024,1280,1536,1792,2048" \
        -e NEURON_TOKEN_GEN_BUCKETS="256,512,1024" \
        vllm-neuron

You can then start the server using the same command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

As the model graph has changed, the model will need to be recompiled. If the container was terminated, the model will be downloaded again. You can then send a request by detaching from the container by pressing ctrl + p and ctrl + q and using the same command:

curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

For more information about how to configure the buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS corresponds to context_length_estimate in the documentation and NEURON_TOKEN_GEN_BUCKETS corresponds to n_positions in the documentation.

Conclusion

You’ve just seen how to deploy meta-llama/Llama-3.2-1B using vLLM on an Amazon EC2 Inf2 instance. If you’re interested in deploying other popular LLMs from Hugging Face, you can replace the modelID in the vLLM serve command. More details on the integration between the Neuron SDK and vLLM can be found in the Neuron user guide for continuous batching and the vLLM guide for Neuron.

After you’ve identified a model that you want to use in production, you will want to deploy it with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post of this series, we’ll go into using Amazon EKS with Ray Serve to deploy vLLM into production with autoscaling and observability.

About the authors

Omri Shiv is an Open Source Machine Learning Engineer focusing on helping customers through their AI/ML journey. In his free time, he likes cooking, tinkering with open source and open hardware, and listening to and playing music.

Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.

Select your cookie preferences

AWS Machine Learning Blog