AWS Machine Learning Blog

Category: Amazon SageMaker HyperPod

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

Unleash AI innovation with Amazon SageMaker HyperPod

In this post, we show how SageMaker HyperPod, and its new features introduced at AWS re:Invent 2024, is designed to meet the demands of modern AI workloads, offering a persistent and optimized cluster tailored for distributed training and accelerated inference at cloud scale and attractive price-performance.

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

In this two-part series, we discuss how you can reduce the DeepSeek model customization complexity by using the pre-built fine-tuning workflows (also called “recipes”) for both DeepSeek-R1 model and its distilled variations, released as part of Amazon SageMaker HyperPod recipes. In this first post, we will build a solution architecture for fine-tuning DeepSeek-R1 distilled models and demonstrate the approach by providing a step-by-step example on customizing the DeepSeek-R1 Distill Qwen 7b model using recipes, achieving an average of 25% on all the Rouge scores, with a maximum of 49% on Rouge 2 score with both SageMaker HyperPod and SageMaker training jobs. The second part of the series will focus on fine-tuning the DeepSeek-R1 671b model itself.

Best practices for Amazon SageMaker HyperPod task governance

In this post, we provide best practices to maximize the value of SageMaker HyperPod task governance and make the administration and data science experiences seamless. We also discuss common governance scenarios when administering and running generative AI development tasks.

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

In this blog post, we showcase how you can perform efficient supervised fine tuning for a Meta Llama 3 model using PEFT on AWS Trainium with SageMaker HyperPod. We use HuggingFace’s Optimum-Neuron software development kit (SDK) to apply LoRA to fine-tuning jobs, and use SageMaker HyperPod as the primary compute cluster to perform distributed training on Trainium. Using LoRA supervised fine-tuning for Meta Llama 3 models, you can further reduce your cost to fine tune models by up to 50% and reduce the training time by 70%.

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

Fastweb, one of Italy’s leading telecommunications operators, recognized the immense potential of AI technologies early on and began investing in this area in 2019. In this post, we explore how Fastweb used cutting-edge AI and ML services to embark on their LLM journey, overcoming challenges and unlocking new opportunities along the way.

Solution overview

Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience

In this post, we explore a solution for implementing load balancing across login nodes in Slurm-based HyperPod clusters. By distributing user activity evenly across all available nodes, this approach provides more consistent performance, better resource utilization, and a smoother experience for all users. We guide you through the setup process, providing practical steps to achieve effective load balancing in your HyperPod clusters.

Speed up your cluster procurement time with Amazon SageMaker HyperPod training plans

In this post, we explore how Amazon SageMaker HyperPod training plans accelerate compute resource procurement for machine learning workloads. We guide you through a step-by-step implementation on how you can use the AWS CLI or the AWS Management Console to find, review, and create optimal training plans for your specific compute and timeline needs. We further guide you through using the training plan to submit SageMaker training jobs or create SageMaker HyperPod clusters.

Generative AI foundation model training on Amazon SageMaker

Generative AI foundation model training on Amazon SageMaker

In this post, we explore how organizations can cost-effectively customize and adapt FMs using AWS managed services such as Amazon SageMaker training jobs and Amazon SageMaker HyperPod. We discuss how these powerful tools enable organizations to optimize compute resources and reduce the complexity of model training and fine-tuning. We explore how you can make an informed decision about which Amazon SageMaker service is most applicable to your business needs and requirements.

Scalable training platform with Amazon SageMaker HyperPod for innovation: a video generation case study

In this post, we share an ML infrastructure architecture that uses SageMaker HyperPod to support research team innovation in video generation. We will discuss the advantages and pain points addressed by SageMaker HyperPod, provide a step-by-step setup guide, and demonstrate how to run a video generation algorithm on the cluster.