Deploying managed P4d Instances in Amazon Elastic Kubernetes Service with NVIDIA GPUDirectRDMA

In March 2021, Amazon EKS announced support for Amazon EC2 P4d instances, enabling you to launch a fully managed EKS cluster based on the latest NVIDIA A100 GPUs. Amazon EC2 P4d instances are the next generation of GPU-based instances that provide the best performance for machine learning (ML) training and high performance computing (HPC) in the cloud for applications such as natural language processing, object detection and classification, seismic analysis, and genomics research. This post takes you through how you can quickly get started with deploying these instances in a managed EKS cluster.

Product overview:

Each p4d.24xl instance comes equipped with:

8x NVIDIA A100 GPUs
96vCPUs
8x 1 TB of local NVMe storage
4×100 Gbps accelerated networking with support for GPUDirectRDMA utilizing Elastic Fabric Adapter (EFA).

A more thorough deep dive on the Amazon EC2 P4d instances is available here. Setting up the P4d instances with all the performance optimizations related to GPUDirectRDMA (GDRDMA) and the 400-Gbps networking requires manual steps. By providing this in a managed service layer such as Amazon EKS with managed node groups, this infrastructure setup is handled automatically, so you focus on running highly scalable distributed accelerated workloads.

Requirements

Install and configure the follow components in your local environment.

eksctl – You need version 0.43.0+ of eksctl.
kubectl – You use Kubernetes version 1.19 in this blog

You also must set up your environment to authenticate and authorize running AWS Command Line Interface (AWS CLI) commands on your behalf. Install v2 and configure your access key and secret token .

Deployment

Setting up the cluster is covered in the following steps . In this example, we walk through running the NVIDIA Collective Communication Library (NCCL) tests to validate utilization of GPUDirectRDMA over Elastic Fabric Adapter (EFA). The AWS samples GitHub repo for EFA on EKS has additional examples tailored to ML workloads.

Step One: In your AWS Region, ensure at least one of the Availability Zones contains P4d instances. You can check availability with the following command:

aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters Name=instance-type,Values=p4d.24xlarge \
  --region us-west-2 \
  --output table

Step 2: Copy and paste the following code in your editor and replace any values specific to your Region.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: p4d-cluster
  version: "1.19"
  region: us-west-2

availabilityZones: ["us-west-2b", "us-west-2c"]

iam:
  withOIDC: true
  
addons:
  - name: vpc-cni
    version: v1.7.10-eksbuild.1  

managedNodeGroups:
  - name: p4d-ng-2c
    instanceType: p4d.24xlarge
    instancePrefix: p4d-ng-2c-worker
    privateNetworking: true
    availabilityZones: ["us-west-2c"]
    efaEnabled: true
    minSize: 2
    desiredCapacity: 2
    maxSize: 4
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        cloudWatch: true

This eksctl config file creates a VPC, EKS cluster, and P4d managed node group. Also notice the use of EKS add-ons to ensure your cluster is launched with at least VPC CNI version 1.7.10. This is a requirement for EFA traffic. The VPC is created with a private and public subnet in each Availability Zone specified. By specifying private networking and a Single-AZ in your managed node group, you ensure that your nodes are launched in a single subnet. This is a requirement for worker nodes to communicate over EFA. Note, you may need to request a limit increase to increase you EC2 On-Demand Instance limits — the default is 128 vCPUs for P series instances. This managed node group can require up to 384 vCPUs (4 p4d.24xlarge instances).

If you have an existing VPC, see this example for how to create a node group with eksctl in a single subnet for an existing VPC. For an existing VPC, ensure that you have the correct networking topology for starting the P4d instances. As a best practice, launch your P4d instances in a private subnet, with a NAT Gateway routing to a public subnet with an Internet Gateway.

Now use config file to create your cluster and node group:

eksctl create cluster -f p4d-managed-cluster.yaml

This command takes some time, as eksctl will be creating a cluster and P4d node group in sequential steps. In the logs of the eksctl bootstrap command, you should see a log entry confirming that the EFA device plugin was successfully applied.

2021-04-02 15:10:38 [ℹ] created "kube-system:DaemonSet.apps/aws-efa-k8s-device-plugin-daemonset"
2021-04-02 15:10:38 [ℹ] as you have enabled EFA, the EFA device plugin was automatically installed
kubectl get nodes

NAME                                     STATUS  ROLES AGE  VERSION
ip-10-0-57-3.us-west-2.compute.internal  Ready <none>  36h  v1.19.6-eks-49a6c0
ip-10-0-72-21.us-west-2.compute.internal Ready <none>  36h  v1.19.6-eks-49a6c0

Step 3: Next, apply the latest version of the NVIDIA K8s device plugin.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml

Describe one of the nodes by calling kubectl describe node ip-10-0-57-3.us-west-2.compute.internal, and you can see the allocatable resources:

Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         96
  ephemeral-storage:           83873772Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               10562Mi
  memory:                      1176329072Ki
  nvidia.com/gpu:              8
  pods:                        737
  vpc.amazonaws.com/efa:       4

By using eksctl and managed node groups, all the heavy lifting of configuring the infrastructure and networking for EFA with GDRDMA is automatically handled. This includes installing the EFA plugin, which presents the EFA network devices as allocatable resources to pods via the vpc.amazonaws.com/efa Kubernetes extended resource. Additionally, with the efaEnabled flag, eksctl automatically handles other EFA prerequisites, including creating an EFA enabled security group, an EC2 placement group, and installing the EFA driver as part of EC2 user data. You can find more details on these steps in the EKS documentation. Next, let’s run the NCCL test to validate our training job throughput.

Step 4: Example Benchmarking

With the base EKS cluster in place, you can then add the Kubeflow MPI Operator for your subsequent tests.

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v1alpha2/mpi-operator.yaml

Next, clone the aws-samples/aws-efa-eks repo and apply the test configuration:

git clone https://github.com/aws-samples/aws-efa-eks
cd examples
kubectl apply -f nccl-efa-tests.yaml

Once the pods startup and are in the Running state:

kubectl get pods
kubectl logs -l=mpi_role_type=launcher --tail=-1

you are able to see the NCCL networking call libfabric and use the underlying EFA devices and GPUDirectRDMA. Here is the expected output:

[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO NET/OFI Selected Provider is efa
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO Using network AWS Libfabric
[1,0]<stdout>:NCCL version 2.8.3+cuda11.2
...
[1,4]<stdout>:nccl-tests-efa-worker-0:30:70 [4] NCCL INFO Channel 06 : 13[901d0] -> 4[901c0] [receive] via NET/AWS Libfabric/2/GDRDMA
[1,12]<stdout>:nccl-tests-efa-worker-1:29:72 [4] NCCL INFO Channel 06 : 5[901d0] -> 12[901c0] [receive] via NET/AWS Libfabric/2/GDRDMA
[1,10]<stdout>:nccl-tests-efa-worker-1:27:65 [2] NCCL INFO Channel 05 : 3[201d0] -> 10[201c0] [receive] via NET/AWS Libfabric/1/GDRDMA
[1,2]<stdout>:nccl-tests-efa-worker-0:28:69 [2] NCCL INFO Channel 05 : 11[201d0] -> 2[201c0] [receive] via NET/AWS Libfabric/1/GDRDMA
[1,8]<stdout>:nccl-tests-efa-worker-1:25:69 [0] NCCL INFO Channel 04 : 1[101d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0/GDRDMA
[1,8]<stdout>:nccl-tests-efa-worker-1:25:69 [0] NCCL INFO Channel 00 : 8[101c0] -> 15[a01d0] via P2P/IPC/read
[1,0]<stdout>:nccl-tests-efa-worker-0:26:67 [0] NCCL INFO Channel 04 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0/GDRDMA
...
[1,0]<stdout>:#
[1,0]<stdout>:#                                                     out-of-place                       in-place          
[1,0]<stdout>:#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
[1,0]<stdout>:#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO Launch mode Parallel
[1,1]<stdout>:nccl-tests-efa-worker-0:27:72 [1] NCCL INFO comm 0x7fece4000dc0 rank 1 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
[1,0]<stdout>:           8             2   float     sum    167.5    0.00    0.00  2e-07    167.1    0.00    0.00  1e-07
[1,0]<stdout>:          16             4   float     sum    167.3    0.00    0.00  1e-07    167.4    0.00    0.00  1e-07
[1,0]<stdout>:          32             8   float     sum    167.9    0.00    0.00  1e-07    167.5    0.00    0.00  1e-07
[1,0]<stdout>:          64            16   float     sum    167.7    0.00    0.00  1e-07    167.7    0.00    0.00  6e-08
[1,0]<stdout>:         128            32   float     sum    168.0    0.00    0.00  6e-08    167.9    0.00    0.00  6e-08
[1,0]<stdout>:         256            64   float     sum    168.6    0.00    0.00  6e-08    168.9    0.00    0.00  6e-08
[1,0]<stdout>:         512           128   float     sum    374.7    0.00    0.00  6e-08    170.1    0.00    0.01  6e-08
[1,0]<stdout>:        1024           256   float     sum    182.5    0.01    0.01  5e-07    182.3    0.01    0.01  5e-07
[1,0]<stdout>:        2048           512   float     sum    205.0    0.01    0.02  5e-07    205.0    0.01    0.02  5e-07
[1,0]<stdout>:        4096          1024   float     sum    233.3    0.02    0.03  5e-07    234.4    0.02    0.03  5e-07
[1,0]<stdout>:        8192          2048   float     sum    250.5    0.03    0.06  5e-07    249.5    0.03    0.06  5e-07
[1,0]<stdout>:       16384          4096   float     sum    254.2    0.06    0.12  5e-07    253.9    0.06    0.12  5e-07
[1,0]<stdout>:       32768          8192   float     sum    260.1    0.13    0.24  5e-07    259.7    0.13    0.24  5e-07
[1,0]<stdout>:       65536         16384   float     sum    273.9    0.24    0.45  5e-07    273.8    0.24    0.45  5e-07
[1,0]<stdout>:      131072         32768   float     sum    294.2    0.45    0.84  5e-07    294.2    0.45    0.84  5e-07
[1,0]<stdout>:      262144         65536   float     sum    304.9    0.86    1.61  5e-07    305.5    0.86    1.61  5e-07
[1,0]<stdout>:      524288        131072   float     sum    409.7    1.28    2.40  5e-07    410.3    1.28    2.40  5e-07
[1,0]<stdout>:     1048576        262144   float     sum    483.5    2.17    4.07  5e-07    483.6    2.17    4.07  5e-07
[1,0]<stdout>:     2097152        524288   float     sum    660.3    3.18    5.95  5e-07    672.4    3.12    5.85  5e-07
[1,0]<stdout>:     4194304       1048576   float     sum    817.0    5.13    9.63  5e-07    817.0    5.13    9.63  5e-07
[1,0]<stdout>:     8388608       2097152   float     sum   1228.0    6.83   12.81  5e-07   1223.6    6.86   12.85  5e-07
[1,0]<stdout>:    16777216       4194304   float     sum   1895.5    8.85   16.60  5e-07   1900.9    8.83   16.55  5e-07
[1,0]<stdout>:    33554432       8388608   float     sum   3106.8   10.80   20.25  5e-07   3104.1   10.81   20.27  5e-07
[1,0]<stdout>:    67108864      16777216   float     sum   5567.2   12.05   22.60  5e-07   5566.4   12.06   22.61  5e-07
[1,0]<stdout>:   134217728      33554432   float     sum   9388.6   14.30   26.80  5e-07   9343.3   14.37   26.93  5e-07
[1,0]<stdout>:   268435456      67108864   float     sum    16865   15.92   29.84  5e-07    16853   15.93   29.86  5e-07
[1,0]<stdout>:   536870912     134217728   float     sum    32206   16.67   31.26  5e-07    32151   16.70   31.31  5e-07
[1,0]<stdout>:  1073741824     268435456   float     sum    61556   17.44   32.71  5e-07    61303   17.52   32.84  5e-07

Step 5: Cleanup
To clean up the environment, you can delete the entire cluster and node group with the following command:

eksctl delete cluster --name p4d-cluster --region us-west-2 --wait

Conclusion

In this post, you learned how to get started with deploying machine learning applications that take full advantage of P4d instances on EKS. By using eksctl with managed node groups, all of the infrastructure setup required for managed elastic scaling of P4d instances with GPUDirectRDMA over EFA is completely automated. You looked at how the NCCL tests ran an all-reduce job across all 16 GPUs and network bandwidth in the two-node cluster. At AWS, we have already seen several EKS customers move to P4d and reduce their time to complete distributed ML training by nearly 50%, and we are excited to see what kind of improvements you will experience, in addition to new types of machine learning this capability unlocks. As always, feel free to leave feedback and comments on either the AWS sample repository, or the AWS Containers roadmap.

Containers

Deploying managed P4d Instances in Amazon Elastic Kubernetes Service with NVIDIA GPUDirectRDMA

Product overview:

Requirements

Deployment

Resources

Follow