Learn how to deploy Falcon 2 11B on Amazon EC2 c7i instances for model Inference

This post is written by Paul Tran, Senior Specialist SA; Asif Mujawar, Specialist SA Leader; Abdullatif AlRashdan, Specialist SA; and Shivagami Gugan, Enterprise Technologist.

Technology Innovation Institute (TII) has developed Falcon 2 11B foundation model (FM), a next-generation AI model that can be now deployed on Amazon Elastic Compute Cloud (Amazon EC2) c7i instances, which support Intel Advanced Matrix Extensions (Intel AMX). Intel AMX is designed to accelerate matrix operations on a CPU, which is fundamental for deep learning and AI workloads.

This post walks through the concept of model quantization and why quantization is vital for real-time use cases. You will be able to reproduce this process and run the model on a CPU at the end of this post.

Open source developers and customers can now use the EC2 C7i instance family to run their AI-powered applications using the Falcon 2 11B model on a CPU, offering a cost-effective alternative to the traditional GPU route to build cost-efficient solutions. It also unlocks deployment of the large language model (LLM) on widely available hardware that enables deployment of efficient and scalable AI applications.

Falcon 2 11B model

Falcon 2 11B is the first FM model from TII’s newly released Falcon 2 series. It is a more efficient and accessible LLM that was trained on a massive dataset of 5.5 trillion tokens consisting of web data from RefinedWeb. The model is built on causal decoder-only architecture with 11 billion trainable parameters, making it powerful for autoregressive tasks. It’s equipped with multilingual capabilities and can seamlessly tackle tasks in English, French, Spanish, German, Portuguese, and other languages for diverse scenarios. The new Falcon 2 Series, trained on Amazon SageMaker, widens the capabilities of Falcon by making it more efficient and multilingual. More detailed information on the previous generation of Falcon models can be found at RefinedWeb, Falcon 40B foundation model from TII available on SageMaker JumpStart, and Falcon 180B foundation model from TII is now available via Amazon SageMaker JumpStart.

Falcon 2 11B is supported by the Amazon SageMaker Text Generation Inference (TGI) Deep Learning Container (DLC), an open source, purpose-built solution for deploying and serving LLMs that enables high performance text generation using tensor parallelism and dynamic batching. The model is available under the TII Falcon License 2.0, the permissive Apache 2.0-based software license, which includes an acceptable use policy that promotes the responsible use of AI. In addition, Falcon 2 11B is now available on Amazon SageMaker JumpStart.

INT8 and INT4 quantization

For real-time applications such as Retrieval Augmented Generation (RAG) or code generation, reducing the latency for generating the first token is crucial, and minimizing the time between generating subsequent tokens promotes a seamless experience. These benefits are essential for applications that require real-time interaction, where even minor delays could disrupt the user experience and hinder productivity.

The OpenVINO framework enables INT8 and INT4 quantization and weight compression. This optimizes performance by reducing the model size and computational demands, making generative AI more cost-effective. These optimizations lower inference cost while maintaining high quality outputs.

Intel AMX is a new built-in accelerator that improves the performance of deep learning training and inference on the CPU. It provides significant performance advantages for INT8 and INT4 inferencing by accelerating matrix multiplication for deep learning workloads, which are essential in AI model execution. By using AMX, INT8 and INT4 precision models can run faster and more efficiently, without substantial loss in accuracy, providing for quicker inference and lower power consumption.

Benchmark results summary

The C7i.24xlarge instance was used for benchmarking the inference performance of the Falcon 2 11B model. The results showcased the effectiveness of INT8 and INT4 quantization techniques, using the OpenVINO toolkit. The results demonstrate significant improvements in latency and throughput.

The code to reproduce the benchmarks on the c7i instance can be found at openvino.genai. Please note that benchmarking performance varies by use, configuration, and other factors.

Case 1

Quantization technique using INT8 Falcon 2 11B with OpenVINO on c7i.24xlarge.

Quantization	Batch	Input prompt tokens	Output tokens	First token latency ms/token	Second token latency ms/token	Throughput tokens/s
INT8	1	32	128	148.07	82.51	12.12
INT8	1	64	128	189.98	79.74	12.54
INT8	1	128	128	283.50	80.26	12.46
INT8	1	512	128	1037.25	82.03	12.19
INT8	1	1024	128	1961.91	84.76	11.80
INT8	1	2048	128	4068.90	90.40	11.06

Table 1: Quantization technique using INT8 Falcon 2 11B with OpenVINO on c7i.24xlarge

Case 2

Quantization technique using INT4 Falcon11B with OpenVINO on c7i.24xlarge.

Quantization	Batch	Input prompt tokens	Output tokens	First token latency ms/token	Second token latency ms/token	Throughput tokens/s
INT4	1	32	128	142.00	59.06	16.93
INT4	1	64	128	195.73	82.58	12.11
INT4	1	128	128	274.67	80.40	12.44
INT4	1	512	128	991.34	82.22	12.16
INT4	1	1024	128	1922.87	85.18	11.74
INT4	1	2048	128	4079.58	90.11	11.10

Table 2: Quantization technique using INT4 Falcon11B with OpenVINO on c7i.24xlarge.

As these results show, INT8 quantization provides excellent latency and throughput numbers. This demonstrates how suitable inference on a CPU is for real-time use cases such as RAG or code generation. Increased performance can also be obtained with more aggressive quantization, such as INT4.

AWS customers can now explore and deploy Falcon 2 11B models using c7i instances across AWS Regions with dedicated performance improvements using OpenVINO Toolkit.

Quantize Falcon 2 11B using OpenVINO and run inference

Developers can use the following approach to quantize Falcon 2 11B and optimize to run on CPU instances. The model is pulled from HuggingFace hub through the OpenVINO framework. A full list of compatible models can be found at AI Models verified for OpenVINO.

For the model quantization, the process requires large RAM memory peaking up to over 116 GB for 11 billion parameters at full precision, hence the choice of running the experiments on c7i.24xlarge equipped with 192 GB of RAM for convenience.

This is shown in the following figure during quantization.

Figure 1: Model quantization

Once quantization is achieved, the converted model weights are stored in the instance disk ready to be consumed for inference.

The inference memory requirement is lower compared to the quantization process because the memory footprint of the quantized model for inference is approximately 12 GB, as illustrated in following figure. Running the inference on the c7i.24xlarge instance allows you to benefit from the increased number of cores, which directly correlates with model speed.

Figure 2: Memory footprint of quantized model for inference

Quantize Falcon 2 11B

To quantize Falcon 2 11B using OpenVINO, complete the following steps:

1. Create c7i.24xlarge EC2 instance in your AWS account.
2. Size the Amazon Elastic Block Store (Amazon EBS) storage depending on the number of models you need to experiment with. For instance, using Falcon 2 11B model weights at full precision requires approximately 24 GB of storage. For comfort, we recommend running with at least 150 GB of storage to experiment with different precisions and models.
3. Connect to your EC2 instance. See Connect to your EC2 instance to see the different options you can use to access your EC2 instance.
4. Create a virtual environment using the following code example.

python3 -m venv ov-llm-bench-env
source ov-llm-bench-env/bin/activate
pip install --upgrade pip

5. Clone repository and navigate to the source directory:

git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai/llm_bench/python/

6. Install dependencies:

pip install -r requirements.txt

7. Run model conversion:

python convert.py --model_id tiiuae/falcon-11B --output_dir model_weights/int8/ --compress_weights INT8 4BIT_DEFAULT

This command perform quantization using OpenVINO on Falcon2-11B model. It downloads the model from HuggingFace and output the converted model in the path specified in -output_dir

Test Falcon2-11B for inference

OpenVINO provides a convenient interface that allows to use the model with HuggingFace transformers library. Run the following python script on your EC2 instance to test inference:

from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
 
if __name__ == '__main__':
    model_path= "model_weights/int8/pytorch/dldt/compressed_weights/OV_FP32-INT8/"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = OVModelForCausalLM.from_pretrained(model_path)
 
    inputs = tokenizer("What is OpenVINO?", return_tensors="pt")
 
    outputs = model.generate(**inputs, max_length=200)
    text = tokenizer.batch_decode(outputs)[0]
    print(text)# Falcon2-11B inference completion

The following screenshot shows a sample generated from the previous code. The output is truncated to generate 200 tokens maximum.

Figure 3: Inference results after running Python script

Conclusion

In this post, we showcased Falcon 2 11B quantization and optimized performance using OpenVINO on AMX capable EC2 c7i instances. This enables deployment of LLM-based applications on widely available CPUs with high performance as a cost-effective alternative without sacrificing speed. These optimizations significantly lower inference costs while maintaining high-quality outputs.

For more information, refer to Amazon EC2 C7i and C7i-flex instances, SageMaker JumpStart pre-trained models, Amazon SageMaker JumpStart Foundation Models, OpenVINO, and the reference code for quantization.

Leave your feedback in the comments so we can continue to improve upon this benchmark.

AWS Compute Blog