AWS Cloud Operations Blog

Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights

As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility into how models are utilizing the GPU resources over time, it is difficult to optimize cluster utilization, troubleshoot anomalies, and ensure models are training as quickly as possible. Machine Learning experts needs an easy-to-use observability solution to monitor GPUs to correlate metric patterns with model behaviour and infrastructure utilization.

For workloads that require distributed training, Elastic Fabric Adapter (EFA) metrics become important along with the individual node performance because understanding the performance of inter-node communication during distributed model training is another aspect of validating model performance and infrastructure health.

Historically customers needed to manually install multiple agents like NVIDIA DCGM exporter for GPU metrics and depend on custom built Prometheus Node exporters for EFA metrics. . Additionally, you need to build custom dashboards and alarms to visualize and monitor these metrics. To address the challenges of monitoring GPUs on Amazon Elastic Kubernetes Service (Amazon EKS) along with the performance of inter-node communication for EFAs, Amazon CloudWatch has extended Container Insights for Amazon EKS with accelerated compute observability including the support for NVIDIA GPUs and EFAs.

Container Insights for Amazon EKS deploys and manages the lifecycle of the NVIDIA DCGM exporter that collects GPU metrics from Nvidia’s drivers and exposes them to CloudWatch. Once onboarded to Container Insights, CloudWatch automatically detects NVIDIA GPUs in your environment, collects the critical health and performance metrics on NVIDIA GPUs as CloudWatch Metrics and makes them available on curated out-of-the-box dashboards. You can also setup CloudWatch alarms and create additional CloudWatch dashboards for the metrics available under the “ContainerInsights” namespace. It gathers performance metrics like GPU temperature, GPU Utilization, GPU Memory Utilization and more. A complete list of metrics could be found in the user guide.

Container Insights for Amazon EKS leverages file system counter metrics to gather and publish Elastic Fabric Adapter (EFA) metrics to CloudWatch. Using EFA metrics you can understand the traffic impact on tasks running on your EKS clusters and monitor your latency sensitive training jobs. It gathers performance metrics like received bytes, transmitted bytes, Remote Direct Memory Access (RDMA) throughput, number of dropped packets and more. A complete list of metrics can be found in the user guide.

In this post, we’ll explore how to leverage Container Insights with enhanced observability for EKS and quickly gain observability insights for GPUs and EFAs on EKS, using CloudWatch and CloudWatch Container Insights.

Solution Overview

You can enable Container Insights for Amazon EKS either via manual installation leveraging quick start setup for Amazon EKS cluster or by installing the Amazon CloudWatch Observability EKS add-on which is the recommended method.

In this example you will see how to setup an Amazon EKS demo cluster with CloudWatch Observability EKS add-on for a supported NVIDIA GPU backed & EFA supported instance. Furthermore, we will see how Container Insights with enhanced observability for EKS will provide a unified view of cluster health, Infrastructure metrics and GPU/EFA metrics required to optimize machine learning workloads.

Following are the components we are going to deploy in this solution:

Container Insights with enhanced observability for EKS gathering GPU & EFA metrics for NVIDIA instances.Figure 1: Container Insights with enhanced observability for EKS gathering GPU & EFA metrics for NVIDIA instances.

Prerequisites

You will need the following to complete the steps in this post:

Environment setup

  1. Provide the AWS Region (aa-example-1) along with 2 AWS Availability Zones (aa-example-1a, aa-example-1b) available in the AWS Region where you would like to deploy your EKS cluster. Run the following commands in the AWS CLI.
export AWS_REGION=<YOUR AWS REGION> 
#export AWS_REGION=aa-example-1
export AWS_ZONE1=<YOUR AWS AVAILABILITY ZONE1> 
#export AWS_ZONE1=aa-example-1a
export AWS_ZONE2=<YOUR AWS AVAILABILITY ZONE2> 
#export AWS_ZONE2=aa-example-1b

2. Provide the name of the EKS Cluster in place of “YOUR CLUSTER NAME” below.

export CLUSTER_NAME=<YOUR CLUSTER NAME> 
#export CLUSTER_NAME=DEMO_CLUSTER

To setup a GPU based Amazon EKS cluster, select the EC2 nodes supporting GPU’s. You can find the list of instances that support GPUs at GPU-based Amazon EC2 instances and supported EFA instance types.

3. For the demonstration, we have selected g4dn.8xlarge which is a NVIDIA GPU supported instance with EFA availability.

export NODE_TYPE=<YOUR NODE TYPE> 
#export NODE_TYPE=g4dn.8xlarge

4. Let’s create a config file for creating an Amazon EKS Cluster by executing the command below.

cat << EOF > ./efa-gpu-cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: "$CLUSTER_NAME"
  region: "$AWS_REGION"
  version: "1.29"

iam:
  withOIDC: true

availabilityZones: ["$AWS_ZONE1","$AWS_ZONE2"]  

managedNodeGroups:
  - name: my-efa-ng
    instanceType: "$NODE_TYPE"
    minSize: 1
    desiredCapacity: 2
    maxSize: 3
    availabilityZones: ["$AWS_ZONE1"]
    volumeSize: 300
    privateNetworking: true
    efaEnabled: true
EOF

5. Now create an Amazon EKS Cluster using the configuration file created.

eksctl create cluster -f efa-gpu-cluster.yaml

6. Verify that you are connected to the cluster. You should see two nodes of type g4dn.8xlarge listed when you execute the following command

kubectl get nodes -L node.kubernetes.io/instance-type

if not execute the following to connect to the cluster.

aws eks update-kubeconfig --name $CLUSTER_NAME --region $AWS_REGION

7. Install the EFA device plugin to provide pods to access EFA devices.

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml

8. Store the CloudFormation stack name as a variable for the EKS Cluster which has been created.

STACK_NAME=$(eksctl get nodegroup --cluster $CLUSTER_NAME -o json | jq -r '.[].StackName')

9. Retrieve the AWS Role into the ROLE_NAME variable, that has been created automatically by the CloudFormation to add permissions to store the metrics and logs data in CloudWatch.

ROLE_NAME=$(aws cloudformation describe-stack-resources --stack-name $STACK_NAME | jq -r '.StackResources[] | select(.ResourceType=="AWS::IAM::Role") | .PhysicalResourceId')

10. Attach the “CloudWatchAgentServerPolicy” to the Amazon EKS Nodes role.

aws iam attach-role-policy --role-name $ROLE_NAME --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

11. Install CloudWatch Observability EKS add-on for Amazon EKS Cluster

aws eks create-addon --addon-name amazon-cloudwatch-observability --cluster-name $CLUSTER_NAME --region $AWS_REGION |jq '.addon.status'

12. Verify that the CloudWatch Observability EKS add-on for Amazon EKS Cluster is created and active. You should see the status as “Active”.

aws eks describe-addon --addon-name amazon-cloudwatch-observability --cluster-name $CLUSTER_NAME --region $AWS_REGION | jq '.addon.status'

GPU Observability test case

Now that you have deployed the Amazon EKS cluster with GPU nodes, let’s generate the GPU load using the “gpuburn” utility.

1. Generate GPU load using the gpuburn utility by applying the following deployment manifest:

cat << EOF > ./gpuburn-deployment.yaml
kind: Deployment
apiVersion: apps/v1
metadata:
  name: gpuburn
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gpuburn
  template:
    metadata:
      labels:
        app: gpuburn
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: "$NODE_TYPE"
      containers:
      - name: main
        image: "iankoulski/gpuburn"
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          runAsUser: 1000
        command: ["bash", "-c", "while true; do /app/gpu_burn 300; sleep 300; done"]
        resources:
          limits:
            nvidia.com/gpu: 1
EOF

2. Deploy the “gpuburn” utility in the EKS Cluster.

kubectl apply -f ./gpuburn-deployment.yaml

3. Check the status of your gpu-burn and wait until the pods enter the running state

kubectl get pods

Output should look similar as shown below.

gpuburn pods running in a EKS cluster

This deployment raises the GPU utilization to 100% for 5 minutes, then cools off for 5 minutes and runs infinitely until the deployment is deleted.

EFA Observability test case

To produce EFA network traffic, you will build a simple container image that has the EFA software and uses a test utility that is included in the EFA installer.

1. Create the Dockerfile for creating the container image

cat << EOF > ./Dockerfile
FROM ubuntu:20.04
ARG EFA_INSTALLER_VERSION=1.30.0
RUN apt update && apt install curl -y
# Install EFA
RUN cd \$HOME \
    && curl -O https://efa-installer.amazonaws.com/aws-efa-installer-\${EFA_INSTALLER_VERSION}.tar.gz \
    && tar -xf \$HOME/aws-efa-installer-\${EFA_INSTALLER_VERSION}.tar.gz \
    && cd aws-efa-installer \
    && ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify
# Setup test
RUN cp \$HOME/aws-efa-installer/efa_test.sh /efa_test.sh && sed -i -e 's/-e rdm -p efa/-e rdm -p efa -I 10000 -S 8388608/g' /efa_test.sh
RUN useradd -m -d /home/ubuntu ubuntu
USER ubuntu
CMD /bin/sh -c ". /etc/profile.d/zippy_efa.sh && while true; do /efa_test.sh; done"
EOF

2. Now build and push the container image to an Elastic Container Registry that we will create below.

export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/
export IMAGE=efaburn
docker build -t ${REGISTRY}${IMAGE} .
aws ecr create-repository --repository-name $IMAGE --region $AWS_REGION
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $REGISTRY
docker image push ${REGISTRY}${IMAGE}

3. Create a DaemonSet YAML configuration file to deploy the container image

cat << EOF > ./efaburn-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: efaburn
spec:
  selector:
    matchLabels:
      name: efaburn
  template:
    metadata:
      labels:
        name: efaburn
    spec:
      containers:
      - name: efaburn
        image: "$REGISTRY$IMAGE"
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          runAsUser: 1000
        resources:
          requests:
            vpc.amazonaws.com/efa: 1
            hugepages-2Mi: 5120Mi
            memory: 8000Mi
          limits:
            vpc.amazonaws.com/efa: 1
            hugepages-2Mi: 5120Mi
            memory: 8000Mi
EOF

4. Deploy the container image to your cluster as a daemonset

kubectl apply -f ./efaburn-daemonset.yaml

This will run the efa_test.sh utility on one EFA Adapter on each of the nodes in your cluster, which should utilize the full bandwidth supported by the single EFA.

5. Verify that your daemonset is deployed successfully

kubectl get daemonset

6. You should see the efaburn daemonset with two READY pods.

kubectl output

Container Insights dashboards

Container Insights additionally provides out-of-the-box dashboards where you can analyze aggregated metrics at the cluster, namespace and service levels. But more importantly, it delivers drill-down capabilities that allow insights at the node, pod, container and GPU device levels. This provides Machine Learning experts to identify bottlenecks throughout the stack. With highly granular visualizations of metrics like memory usage and utilization, you can quickly pinpoint issues—whether they be a certain node, pod or even a specific GPU.

You can navigate to CloudWatch Console and expand “Insights” and select “Container Insights”. It first opens a landing page, where you can understand the performance and status summary of GPUs across your EKS clusters. Furthermore, you can slice and dice down the performance of the GPUs to understand Top 10 clusters, nodes, workloads, pods, containers running in your AWS Account as shown below.

CloudWatch Container Insights Dashboard – Top 10 Utilization

Figure 2: CloudWatch Container Insights Dashboard – Top 10 Utilization

You can select View performance dashboards link on the top right-hand corner to access the detailed dashboard’s view. Under detailed performance dashboard’s view, you can access your accelerated compute telemetry out-of-the-box as shown below.

CloudWatch Container Insights Dashboard – Cluster Level Performance View
Figure 3: CloudWatch Container Insights Dashboard – Cluster Level Performance View

You can either use hierarchy map to drill down or click on graph labels and view container level dashboards and get aggregated metrics by Container or GPU device. This will be useful for instance types with multiple GPU devices, allowing you to see the utilization of each GPU and understand the extent to which you are utilizing your hardware.. With this visibility, you can carefully tune workload placement across GPUs to balance resource usage and remove the guesswork around how container scheduling will impact per-GPU performance.

You can look at the dashboard below with an aggregated performance view at container level performance and the utilization of the GPUs by container, pod level.

CloudWatch Container Insights Dashboard GPU Metrics – Pod Level, Container Level

Figure 4: CloudWatch Container Insights Dashboard GPU Metrics – Pod Level, Container Level

You can also aggregate the GPU metrics by GPU device level, which will provide overview on how each GPU device is performing as shown below.

CloudWatch Container Insights Dashboard GPU Metrics aggregated by GPU Device.

Figure 5: CloudWatch Container Insights Dashboard GPU Metrics aggregated by GPU Device.

You can visualize the EFA metrics at the node, pod, container level as a part of the CloudWatch Container Insights dashboard by selecting the respective radio buttons as shown in the below diagram.

CloudWatch Container Insights Dashboard EFA Metrics – Node Level

Figure 6: CloudWatch Container Insights Dashboard EFA Metrics – Node Level

Container Insights provides you to easily monitor the efficiency of resource consumption by your distributed deep learning and inference algorithms such that you can optimize resource allocation and minimize long disruptions in your applications. Using Container Insights, you can now have detailed observability on your accelerated compute environment with automatic visualizations out -of-the-box.

Clean up

You can tear down the whole stack using the command below

eksctl delete cluster $CLUSTER_NAME

Conclusion

In this blog post, we showed how to setup robust observability for GPU workloads running in an accelerated compute environment, deployed in an Amazon EKS cluster leveraging Amazon EC2 instances, featuring NVIDIA GPUs and Amazon EFAs. Furthermore, we have looked into the dashboards and drill down between different layers to understand the performance of GPUs and EFAs at the cluster, pod, container and GPU device level.

For more information, see the following references:

About the authors:

Phani Kumar Lingamallu

Phani Kumar Lingamallu is a Senior Solutions Architect with Amazon Web Services. He works with AWS partners in building solutions and provide them with architectural guidance for building scalable architecture and implementing strategies to drive adoption of AWS services. He is a technology enthusiast and an author of AWS Observability handbook with core areas of interest in Cloud Operations and Observability.

Alex Iankoulski

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source [Do framework](https://bit.ly/do-framework) and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Michael O’Neill

Michael O’Neill is a Senior Software Development Engineer working on CloudWatch in the Amazon Monitoring & Observability organization in AWS. Michael loves graphs and is passionate about instrumenting software with observability solutions to reveal operational and technical insights.

Omur Kirikci

Omur Kirikci is a Principal Product Manager for Amazon CloudWatch based in Seattle, US. He is passionate about creating new products and looks for ideas from everywhere in order to deliver solutions with the right quality in a timely fashion. Before he joined AWS, Omur spent more than 15 years in product management, program management, go-to-market strategy, and product development. Outside of work, he enjoys being outdoors and hiking, spending time with his family, tasting different cuisines, and watching soccer with friends.