Guidance for Distributed Model Training on AWS

This Guidance helps customers who have on-premises restrictions or who have existing Kubernetes investments to use either Amazon Elastic Kubernetes Service (Amazon EKS) and Kubeflow or Amazon SageMaker to implement a hybrid, distributed machine learning (ML) training architecture. Kubernetes is a widely adopted system for automating infrastructure deployment, resource scaling, and management of containerized applications. The open-source community developed a layer on top of Kubernetes called Kubeflow, which aims to make the deployment of end-to-end ML workflows on Kubernetes simple, portable, and scalable. With the ability to choose between two approaches at runtime in this architecture, customers gain maximum control over their ML deployments. They can continue using open-source libraries in their deep learning training script and still make it compatible to run on both Kubernetes and SageMaker.

Architecture Diagram

Download the architecture diagram PDF

Guidance for Distributed Model Training on AWS

Step 1
Deploy Kubeflow to Amazon Elastic Kubernetes Service (Amazon EKS) and access Jupyter Notebooks from the Kubeflow Central Dashboard. Kubernetes provides a command line tool (Kubectl) for communicating with a Kubernetes cluster's control plane, using the Kubernetes API.

Step 2
Use the Kubeflow Pipelines software development kit (SDK) to compile Python functions into workflow resources and to create Kubeflow pipelines.

Step 3
Use the Kubeflow Pipelines SDK client to call the pipeline service endpoint and run the pipeline.

Step 4
The pipeline evaluates the conditional runtime variables and decides between Amazon SageMaker or Kubernetes as the target run environment.

Step 5
Use the Kubeflow PyTorch Operator to run distributed training on the Kubernetes cluster, or use the SageMaker component to submit the training on the SageMaker managed platform.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

To support scalable simulation and key performance indicator (KPI) calculation models, use Amazon EKS and Amazon QuickSight.

Read the Operational Excellence whitepaper
Security

Resources are stored in a virtual private cloud (VPC), which provides a logically isolated network. You can grant access to these resources using AWS Identity and Access Management (IAM) roles that grant least privilege, or the minimum number of permissions required to complete a task.

Read the Security whitepaper
Reliability

Kubeflow on AWS supports a data pipeline orchestration.

Read the Reliability whitepaper
Performance Efficiency

If you have on-premises restrictions or existing Kubernetes investments, you can use Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training or use a fully managed SageMaker solution for production-scale training infrastructure. These two options help you scale to meet workload requirements of the training environment.

Read the Performance Efficiency whitepaper
Cost Optimization

We selected resource sizes and types based on resource characteristics and past workloads so you only pay for resources matched to your needs.

Read the Cost Optimization whitepaper
Sustainability

SageMaker is designed to handle training clusters that scale up as needed and shut down automatically when jobs are complete. SageMaker also reduces the amount of infrastructure and operational overhead typically required with training deep learning models on hundreds of GPUs. Amazon Elastic File System (Amazon EFS) integration with the training clusters and the development environment allow you to share your code and processed training dataset, so you don’t have to build the container image and load large datasets after every code change.

Read the Sustainability whitepaper

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open sample code on GitHub

Related Content

AWS Machine Learning

Blog

Build flexible and scalable distributed training architectures using Kubeflow on AWS and Amazon SageMaker

This blog post demonstrates how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used with AWS Deep Learning Containers and Amazon EFS simplifies collaboration and provides flexibility in training deep learning models at scale on both Amazon EKS and Amazon SageMaker utilizing a hybrid architecture approach.

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?

Feedback