This Guidance helps customers who have on-premises restrictions or who have existing Kubernetes investments to use either Amazon Elastic Kubernetes Service (Amazon EKS) and Kubeflow or Amazon SageMaker to implement a hybrid, distributed machine learning (ML) training architecture. Kubernetes is a widely adopted system for automating infrastructure deployment, resource scaling, and management of containerized applications. The open-source community developed a layer on top of Kubernetes called Kubeflow, which aims to make the deployment of end-to-end ML workflows on Kubernetes simple, portable, and scalable. With the ability to choose between two approaches at runtime in this architecture, customers gain maximum control over their ML deployments. They can continue using open-source libraries in their deep learning training script and still make it compatible to run on both Kubernetes and SageMaker.
Architecture Diagram
Step 1
Deploy Kubeflow to Amazon Elastic Kubernetes Service (Amazon EKS) and access Jupyter Notebooks from the Kubeflow Central Dashboard. Kubernetes provides a command line tool (Kubectl) for communicating with a Kubernetes cluster's control plane, using the Kubernetes API.
Step 2
Use the Kubeflow Pipelines software development kit (SDK) to compile Python functions into workflow resources and to create Kubeflow pipelines.
Step 3
Use the Kubeflow Pipelines SDK client to call the pipeline service endpoint and run the pipeline.
Step 4
The pipeline evaluates the conditional runtime variables and decides between Amazon SageMaker or Kubernetes as the target run environment.
Step 5
Use the Kubeflow PyTorch Operator to run distributed training on the Kubernetes cluster, or use the SageMaker component to submit the training on the SageMaker managed platform.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
To support scalable simulation and key performance indicator (KPI) calculation models, use Amazon EKS and Amazon QuickSight.
-
Security
Resources are stored in a virtual private cloud (VPC), which provides a logically isolated network. You can grant access to these resources using AWS Identity and Access Management (IAM) roles that grant least privilege, or the minimum number of permissions required to complete a task.
-
Reliability
Kubeflow on AWS supports a data pipeline orchestration.
-
Performance Efficiency
If you have on-premises restrictions or existing Kubernetes investments, you can use Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training or use a fully managed SageMaker solution for production-scale training infrastructure. These two options help you scale to meet workload requirements of the training environment.
-
Cost Optimization
We selected resource sizes and types based on resource characteristics and past workloads so you only pay for resources matched to your needs.
-
Sustainability
SageMaker is designed to handle training clusters that scale up as needed and shut down automatically when jobs are complete. SageMaker also reduces the amount of infrastructure and operational overhead typically required with training deep learning models on hundreds of GPUs. Amazon Elastic File System (Amazon EFS) integration with the training clusters and the development environment allow you to share your code and processed training dataset, so you don’t have to build the container image and load large datasets after every code change.
Implementation Resources
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Build flexible and scalable distributed training architectures using Kubeflow on AWS and Amazon SageMaker
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.