AWS Machine Learning Blog

Category: Amazon SageMaker HyperPod

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod.

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other. In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and compute create models that showcase the power of generative […]

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

In this post, we explore the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted large language models (LLMs) using Amazon SageMaker HyperPod, an Amazon Web Services (AWS) feature focused on providing purpose-built infrastructure for distributed training at scale.

Introducing Amazon EKS support in Amazon SageMaker HyperPod

Introducing Amazon EKS support in Amazon SageMaker HyperPod

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.

Integrate HyperPod clusters with Active Directory for seamless multi-user login

Amazon SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption. Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, […]