AWS Machine Learning Blog
Use Llama 3.1 405B for synthetic data generation and distillation to fine-tune smaller models
Llama 3.3 70B is now available on SageMaker Inference through the SageMaker Jumpstart Model Hub.
For Details see Llama 3.3 70B now available in Amazon SageMaker JumpStart.
Today, we are excited to announce the availability of the Llama 3.1 405B model on Amazon SageMaker JumpStart, and Amazon Bedrock in preview. The Llama 3.1 models are a collection of state-of-the-art pre-trained and instruct fine-tuned generative artificial intelligence (AI) models in 8B, 70B, and 405B sizes. Amazon SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. Amazon Bedrock offers a straightforward way to build and scale generative AI applications with Meta Llama models, using a single API.
In this post, we show how to use Llama 3.1 405B to generate data (labels for a sample dataset), and how to use the generated data for distillation to fine-tune a smaller model like Llama 3 8B to generate better responses compared to the non-fine-tuned model. We also provide the code notebook that you can use to run and test the solution. The same approach can be used for fine tuning any smaller model including Llama 3.1 8b.
Overview of Llama 3.1 405B
The Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text out). All models support long context length (128,000) and are optimized for inference with support for grouped query attention (GQA). The Llama 3.1 instruction tuned text-only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the publicly available chat models on common industry benchmarks.
Llama 3.1 405B is the first publicly available model that rivals the top models in AI when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. There are some unique ways to use it—in addition to direct inference, you can use the Llama 3.1 405B model to generate synthetic data to improve smaller models, and it can be a powerful domain-specific model by acting as the base model for domain-specific fine-tuning.
Llama 3.1 models are available today for inferencing on SageMaker JumpStart and Amazon Bedrock. On SageMaker JumpStart, they are rolling out to all AWS Regions where SageMaker JumpStart is available and support the required instance types. Llama 3.1 405B will require P5 instances on Amazon SageMaker. The Llama 3.1 models are also available today in the us-west-2
Region on Amazon Bedrock, with planned future expanded Regional availability.
Prerequisites
The following prerequisites are needed to implement the steps outlined in this post:
- An AWS account that will contain all your AWS resources.
- An AWS Identity and Access Management (IAM) role to access SageMaker and Amazon Bedrock. For more information, refer to Identity and Access Management for Amazon SageMaker and Identity and access management for Amazon Bedrock.
- Access to Amazon SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code.
Responses from the Llama 3 8B Instruct model
Firstly, we perform inference with the Llama 3 8B model either directly through Amazon Bedrock or a deployed endpoint using SageMaker JumpStart. With Llama 3 Instruct models, which are optimized for dialogue use cases, the input to the model endpoints is the previous history between the chat assistant and the user. We can ask context-aware questions to conversations that have happened so far, using specific formatting for the input text (described in our earlier Llama 3B release posts, Meta Llama 3 models are now available in Amazon Bedrock and Meta Llama 3 models are now available in Amazon SageMaker JumpStart).
In the following example, the user has a conversation with the assistant about tourist sites in Paris. The assistant generated four different recommendation options, and then the user inquires about the first option:
The Llama 3 8B model is able to generate answers for the questions without issues.
Next, let’s test the ability of Llama 3 8B to answer logical and arithmetic questions (derived from Hugging Face’s AQUA-RAT dataset—instead of multiple choice options, we ask for full answers) as follows:
This answer looks almost correct but not quite. The correct answer is 31 inches long. Similar logical questions are not answered correctly by the Llama 3 8B model.
In order for the Llama 3 8B model to improve its logical question answering capability, we want to fine-tune the model with data from the AQUA-RAT dataset. As we already mentioned, the AQUA-RAT dataset contains multiple choice options for the LLM to choose from. Because we don’t have the full answers for this dataset, we use the Llama 3.1 405B model to generate the verbal answer to the questions, and use that dataset to fine-tune the Llama 3 8B model.
Generate label data using Llama 3.1 405B
Because Llama 3.1 405B is the most capable of the Llama 3.1 collection of models, and because of its state-of-the-art math and general knowledge capabilities, we run direct inference of the questions in the AQUA-RAT dataset on Llama 3.1 405B using either SageMaker JumpStart or Amazon Bedrock. This helps us generate the answers we want to use to fine-tune the smaller Llama 3 8B models. In essence, we’re using Llama 3.1 405B as an alternative to human annotation to generate labels for the dataset. The following are example inference outputs from the 405B model:
We can clearly see that the 405B answer is logically and mathematically correct, and we can use this answer in the place of human annotation for fine-tuning purposes. We concatenate the answers from the 405B model for approximately 2,000 questions from the AQUA-RAT dataset, which becomes our training dataset. Our test dataset comes from extracted answers for approximately 4,000 more questions.
The training data is constructed as a JSON object as follows:
The instruction and response dataset are then used for distillation by fine-tuning the Llama 3 8B model in SageMaker JumpStart.
You can deploy Llama 3.1 models using SageMaker Jumpstart Model Cards. In addition to deploying Llama 3.1 Models, you can fine tune these models using Train tab located next to deploy button.
Fine-tune the Llama 3 8B model with the generated labels
For instructions on fine-tuning this model, refer to Fine-tune Code Llama on Amazon SageMaker JumpStart. We used a g5.12xlarge instance using the instruction fine-tuning option for five epochs, and observed that the loss function showed improvement both on the training and test sets.
When the fine-tuning was complete, we deployed the fine-tuned model for inference using SageMaker JumpStart on a g5.12xlarge instance (you can deploy on any other supported instance based on your throughput and latency requirements). Please refer to Meta’s End User License Agreement (EULA) for requirements when using synthetic-fine-tuned models for any use case.
Then we performed inference on another 1,000 unseen queries from the AQUA-RAT dataset. The following table shows a few of the queries for comparison.
Query | Answer | Llama 3 8B Responses Before Fine-Tuning | Llama 3 8B Responses After Fine-Tuning |
Tim has 350 pounds of cement in 100, 50, and 25 pound bags. He has an equal number of each size bag. How many bags of cement does Tim have? | 6 | We know there are 3 types of bags. 100, 50, and 25 pounds. He has an equal number of each. So if x is the number of each type of bag, then you know 100x + 50x + 25x = 350. This is because each type of bag has x number of bags, and there is 100 pounds of cement per bag in the first case, 50 in the second, and 25 in the third. 175x = 350. So if we divide both sides by 175 we get x = 2. The answer is 2 bags of each. | Let’s assume Tim has X 100 pound bags, X 50 pound bags, and X 25 pound bags. He has in total 100X + 50X + 25X = 175X pounds of cement. Since he has 350 pounds total, we know 175X = 350, so X = 2. That means he has 2 bags of each type, so in total he has 2 + 2 + 2 = 6 bags. The answer is 6. |
The purchase price of an article is $48. In order to include 30% of cost for overhead and to provide $12 of net profit, what should the markup be? | $55 | To solve this problem, we can start by identifying the cost of the article, which is the purchase price of $48. We are also given that the markup should be enough to include 30% of the cost for overhead and provide a net profit of $12. Let’s break this down step by step: 1. Calculate the cost of the article: $48 (purchase price) Therefore, the markup should be $74.40 – $48 = $26.40. This means that the selling price should be $26.40 more than the purchase price to include 30% of the cost for overhead and provide a net profit of $12. |
Step 1: Calculate the cost price of the article\nThe cost price of the article is $48.\n\n## Step 2: Calculate the overhead cost as a percentage of the cost price\nThe overhead cost is 30% of the cost price, which is 0.3 * $48 = $14.40.\n\n## Step 3: Calculate the selling price of the article\nThe selling price of the article is the cost price plus the overhead cost plus the net profit. The net profit is $12.\n\n## Step 4: Calculate the selling price\nThe selling price is $48 + $14.40 + $12 = $74.40.\n\n## Step 5: Calculate the markup as a percentage of the cost price\nThe markup is the difference between the selling price and the cost price, divided by the cost price. The markup is ($74.40 – $48) / $48 = $26.40 / $48 = 55%.\n\nThe final answer is: $55‘} |
The slant height of a cone is 35 cm and radius of the base is 14cm, find the curved surface of the cone. | 1540 | A nice problem! The slant height of a cone is the distance from the center of the base to the vertex of the cone. Let’s call this distance “l”. We are given that the slant height is 35 cm, so: l = 35 cm The radius of the base is given as 14 cm, so the diameter of the base is 28 cm. The formula to find the curved surface area of a cone is: Curved Surface Area = πr(l + r) where r is the radius of the base and l is the slant height. Plugging in the values, we get: Curved Surface Area = π(14)(35 + 14) = π(14)(49) = 686π Now, we can simplify this expression by multiplying the numbers: Curved Surface Area = 686 × 3.14 (approximately) = 2153.24 cm² So, the curved surface area of the cone is approximately 2153.24 cm². |
The formula for the curved surface area of a cone is: Curved Surface Area = πrl Where r is the radius of the base and l is the slant height. Given: r = 14 cm l = 35 cm Plugging in the values, we get: Curved Surface Area = π(14)(35) = 14π(35) = 490π = 490 × 3.14 = 1537.4 cm² Therefore, the curved surface area of the cone is approximately 1537.4 cm². |
The speed at which a man can row a boat in still water is 25 kmph. If he rows downstream, where the speed of current is 11 kmph, what time will he take to cover 80 metres? | 8 seconds | The speed of the man in still water is 25 kmph. The speed of the current is 11 kmph. So the speed downstream is 25 + 11 = 36 kmph. 80 meters is 0.02 km. So the time taken is 0.02 / 36 = 0.0056 hours. The answer is 5.6 minutes. | Speed of the boat in still water = 25 kmph Speed of the current = 11 kmph Speed downstream = 25 + 11 = 36 kmph Speed downstream in m/s = (36 * 5/18) = 10 m/s Distance = 80 m Time = Distance / Speed = 80 / 10 = 8 seconds. |
We can qualitatively see that the answers for the math problems and logical questions are improved after fine-tuning the model. We will perform additional quantitative metric evaluation in future posts. This exercise demonstrates how you can use the Llama 3.1 405B model to efficiently generate datasets in an accelerated fashion and then use those datasets to significantly improve the task-specific capabilities of smaller models.
Conclusion
In this post, we showed how you can use the new Llama 3.1 405B model to synthesize and generate data labels to improve the performance of a much smaller model through distillation (Llama 3 8B in this case). We also showed that the responses generated by the fine-tuned model are much improved compared to the model without fine-tuning. We also provided the code notebook that you can use to run and test the solution.
As a next step, we encourage you to use this idea along with the Llama-3.1 405B model in your use case to generate labels or even unlabeled data that can then be used by a smaller model downstream to help solve your use case.
About the Authors
Sebastian Bustillo is an Enterprise Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through cloud technologies and AI/ML. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and riding his MTB.
Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.
Dr. Natarajan Chennimalai Kumar is a Principal Solutions Architect in the 3rd Party Model Provider team at AWS, working closely with the Llama partner engineering team at Meta to enable AWS customers use Meta’s Llama models. He holds a PhD from University of Illinois at Urbana-Champaign. He is based in the Bay Area in California. Outside of work, he enjoys watching shows with his kids, playing tennis, and traveling with his family.
Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, writing blogs, hiking, spending time with his twin, and playing the guitar.
Dr. Nikita Ivkin is a Senior Applied Scientist for Amazon SageMaker. He focuses on inference acceleration for foundation models and scalable ML algorithms in general. His research interests are in the area of inference acceleration, streaming algorithms, and federated learning, with publishing in a variety of machine learning and computer science venues such as NeurIPS, ICML, ICLR, STOC, PODS, and others.
Supriya Puragundla is a Senior Solutions Architect at AWS. She has over 15 years of IT experience in software development, design, and architecture. She helps key customer accounts on their data, generative AI, and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.
Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, and KDD conferences, and Royal Statistical Society: Series A.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Karl Albertsen leads the product management and partnership teams for Amazon SageMaker. He is focused on making AI accessible, cost-effective, and high-performing for business applications.
Christopher Whitten is an SDE with the SageMaker JumpStart team leading model onboarding and deeper integration with SageMaker services. Chris is passionate about accelerating the ubiquity of AI in practical business applications. His technical interests include agentic workflows and MLOps.
Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his master’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.
Evan Kravitz is a software engineer at Amazon Web Services, working on SageMaker JumpStart. He is interested in the confluence of machine learning with cloud computing. Evan received his undergraduate degree from Cornell University and master’s degree from the University of California, Berkeley. In 2021, he presented a paper on adversarial neural networks at the ICLR conference. In his free time, Evan enjoys cooking, traveling, and going on runs in New York City.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.