How HP achieved 40% improvement in Kubernetes node utilization on Amazon EKS using Karpenter

Introduction

This post was co-authored by Jon Lewis (SW R&D Director in HP), Gajanan Chandgadkar (Principal Cloud Operations Architect, HP), Rutvij Dave (Sr. Solutions Architect at AWS), Ratnopam Chakrabarti (Sr. Solutions Architect, Containers and Open-Source technologies at AWS), Apeksha Chauhan(Senior Technical Account Manager at AWS) and Chance Lee (Sr. Container Specialist Solutions Architect at AWS)

HP PrintOS solution was launched in 2016. Since the launch, PrintOS has connected nearly 30,000 companies, 85,000 users, and 50,000 devices, creating the world’s largest industrial print network. HP’s Industrial print portfolio includes Indigo, Large Format, PageWide Industrial, Scitex, and Multi Jet Fusion (3D) product lines.

PrintOS is a user-facing product. It represents a platform and collection of applications focused on the following main pillars:

Device optimization
Automation
Business growth

The following diagram provides an overview of the PrintOS functional components.

Figure 1: Functional Overview

HP’s 3D business recently released a software application on Amazon Elastic Kubernetes Service (Amazon EKS) that uses artificial intelligence (AI) based volumetric simulations to compensate for 3D printed parts, necessitating substantial compute and memory resources. PrintOS needs a cost-effective solution for scaling EKS worker nodes dynamically.

PrintOS has created a workflow that combines Karpenter and Argo Workflows on Amazon EKS, resulting in automation, streamlined infrastructure management, and enhanced adaptability to workload fluctuations.

This post explains how HP PrintOS uses Karpenter in a GitOps workflow that resulted in a 40% improvement in EKS node utilization and cost savings. This post also provides a walkthrough the key configuration settings for Karpenter and Argo Workflow.

Scenario and challenges

The PrintOS use case involves the execution of Finite Element Analysis (FEA). FEA is a computational technique used for predicting the behavior of structures and materials under various conditions, such as thermal analysis, structural analysis, and fluid dynamics.

At a high level, following are the key stages of the application:

Input: The simulation needs 3D geometry as its input. Users provide parameters defining the physical domain to simulate, such as mesh specifications, boundary conditions, and material properties. Furthermore, numerical parameters are specified for the simulation algorithm.
Process: When the input parameters are defined, the simulation algorithm is run on cloud resources. FEA simulations involve solving complex mathematical equations to model the behavior of the system accurately.
Output: Upon completion, the simulation generates a deformed 3D geometry prediction, reflecting the response of the system under the specified conditions. Then, this output is presented to the user for analysis and decision-making.

Computational complexity: The FEA simulations done by the PrintOS 3D simulation application are computationally intensive, particularly when dealing with intricate geometries and complex physical domains. The computational cost increases with the level of detail in the input mesh and the complexity of the simulated physical phenomena.

A major challenge for the HP PrintOS team was to efficiently manage a diverse range of compute instances on Amazon EKS for 3D simulation efficiently. PrintOS uses a GitOps-based workflow implemented in Argo Workflow to facilitate on-demand execution of these 3D simulations. This enables end users to initiate simulations on-demand and produce the output quickly. To achieve this, they needed a cluster auto scaling solution that enables just-in-time and right-sized scaling of worker nodes, can pick nodes based on workload requirements, optimizes cluster using consolidation for efficient bin packing, and seamlessly integrates with Argo Workflow. Furthermore, the PrintOS team was spending approximately two hours on average for upgrading each EKS cluster that used managed node groups. They were looking to achieve a streamlined and automated node upgrade mechanism to save on the engineering effort involved in the EKS node upgrade process.

These requirements prompted PrintOS to choose Karpenter as their cluster auto scaling solution on Amazon EKS. After the initial successful testing, they implemented Karpenter across multiple EKS clusters.

Solution overview

The following sections describe the solution overview for this post.

Architecture

The following architecture diagram depicts components for running software 3D simulation jobs using Argo Workflows on Amazon EKS with the dynamic provisioning of resources and scaling of worker nodes using Karpenter. Here’s an overview of the overall workflow and a breakdown of the key components and their interactions.

Figure 2:Karpenter and Argo Workflow in Action

Overall workflow:

User initiates a simulation request through the user interface.
The simulation service receives the request and uses Argo Service API to interact with the Argo Workflow for invoking the relevant workflow template.
Argo creates a workflow job based on the template with resource requirements and the necessary configuration.
Karpenter dynamically provisions the Amazon Elastic Compute Cloud (Amazon EC2) instance based on the resource requirements of the job. Karpenter NodePool and Karpenter EC2NodeClass are used to define the eligible instance type and Amazon Machine Images (AMIs).
The Kubernetes scheduler schedules the workflow job pods on available compute. As of this writing, PrintOS uses Amazon EC2 On-Demand Instances for production. For non-production clusters, it creates Amazon EC2 Spot Instances as EKS worker nodes to drive the cost further down.
The workflow is defined as a directed acyclic graph (DAG) with multiple tasks that have dependencies among them. The generic workflow steps are as follows:
- Generate a voxelized mesh from the input triangular mesh.
- Run an FEA over the voxelized mesh to calculate the geometry deformation.
- Run the post-processing steps in parallel:
  - 1 Convert the FEM output to 3MF or other file formats.
  - 2 Generate stress calculation for the geometry.
  - 3 Generate a video of the geometry deformation.
- Upon workflow completion, the Argo Workflow job is deleted.

PrintOS uses the Karpenter Consolidation feature, allowing it to terminate the EC2 instances when the workflow completes its execution, thus lowering infrastructure costs and optimizing resource usage.

The typical request flow for a 3D simulation job is as follows:

External traffic flow into the EKS cluster:

AWS Application Load balancer (ALB):Receives incoming requests to trigger the 3D simulations.
Istio Ingress Gateway:Acts as the entry point for traffic into the Istio service mesh within the EKS cluster.
Virtual Service:Routes traffic from the Istio Ingress Gateway to the appropriate Kubernetes service based on defined rules.
User Interface:The web application that users interact with to initiate simulations.
Backend Simulation Service:Microservice that processes simulation requests and interacts with the Argo Workflows service API.

This architecture effectively uses Karpenter with Argo Workflow to provide a scalable and cost-efficient solution for running 3D simulations on Amazon EKS. The separation of concerns between UI, backend logic, and workflow execution promotes modularity and maintainability.

Solution walkthrough

This section demonstrates how to use Karpenter with Argo Workflows for dynamically provisioning worker nodes on Amazon EKS to run resource-intensive workloads. We guide you through the provided Argo Workflow template and explain how it uses Karpenter and node selectors to make sure that Argo workflows run on appropriately provisioned nodes.

Prerequisites

To configure the example mentioned in this post, you must set up the following components:

Provision the EKS cluster in the AWS account: PrintOS uses Amazon EKS version 1.29.
Karpenter. As of this writing, we used v0.36.1.
Argo Workflow5.2
Argo CLI: 3.3.8

Step 1: Configure Karpenter NodePools and EC2NodeClass

Karpenter NodePools help orchestrate the cost-optimized Spot fleet by targeting EC2 Spot Instances for cost optimization and dynamically provisioning and deprovisioning nodes based on configuration. Here is one of the example NodePools used in the PrintOS solution.

NodePool configuration:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: virtual-workflow-spot-pool
spec:
  template:
    metadata:
      # Labels are arbitrary key-values that are applied to all nodes in the pool
      labels:
        virtual-workflow: "enabled"
    spec:
      # References the Cloud Provider's NodeClass resource, see cloud provider specific documentation
      nodeClassRef:
        name: virtual-workflow
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["c”, “m”, “r”]
 	- key: karpenter.k8s.aws/instance-size
	  operator: NotIn
	  values: ["nano","micro","small","medium"]	
        

        
      # Provisioned nodes will have these taints
      # Taints may prevent pods from scheduling if they are not tolerated by the pod¬
      taints:
        - key: virtual-workflow
          effect: NoSchedule
          value: enabled
  disruption:
    consolidateAfter: 5m
    expireAfter: 12h
  limits:
    cpu: "3840"
    memory: 7680Gi
  weight: 1

EC2NodeClass defines blueprints for EKS worker nodes, defining their configuration and behavior.

EC2NodeClass configuration:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: virtual-workflow
spec:
  amiFamily: AL2 # Amazon Linux 2
  role: "virtual-workflow-workers" 
  subnetSelectorTerms:
    - tags:
        Name: "virtual-workflow-subnet-private*" #  subnets to attach to instances
  securityGroupSelectorTerms:
    - tags:
        Name: "eks-cluster-sg-virtual-workflow" # security group that has to be attached to nodes
  amiSelectorTerms:
    - id: <ami-id>

Step 2: Define the workflow template

The following workflow, named karpenter-workflow-template-demo, is designed to run a “demo-task” using an Argo Workflow template. This template defines a DAG with a single task named demo-task. This task uses a separate workflow named demo-karpenter-workflow that runs the workload logic.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: karpenter-workflow-template-demo
  namespace: argo-workflow  
spec:
  entrypoint: dag-demo-workflow
  arguments:
    # These parameters are sent from the application to the Argo Workflow at runtime.
    parameters:    
      - name: name
      - name: env_var_1
  templates:
    - name: dag-demo-workflow
      dag:
        tasks:
          - name: demo-task
            template: demo-karpenter-workflow
            arguments:
              parameters:
                - name: name
                  value: '{{workflow.parameters.name}}'
                - name: env_var_1
                  value: '{{workflow.parameters.env_var_1}}'  
    - name: demo-karpenter-workflow
      inputs:
          parameters:
            - name: name
            - name: env_var_1
      nodeSelector:
        vfworkflow: enabled
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: virtual-workflow
                    operator: In
                    values:
                      - enabled
      serviceAccount: karpenter-workflow-demo
      serviceAccountName: karpenter-workflow-demo
      tolerations:
        - key: virtual-workflow
          value: enabled
          effect: NoSchedule
      container:
        image: argoproj/argosay:v2
        env:
          - name: <ENV_VAR_1>
            value: '{{inputs.parameters.env_var_1}}'
        args: ["echo", "Hello {{inputs.parameters.name}}!"]
        resources:
          requests:
            cpu: 48
            memory: 96G

Key aspects:

Karpenter and Spot Instances: The demo-karpenter-workflow template has a nodeSelector This makes sure that pods belonging to this workflow run exclusively on the nodes with the label virtual-workflow:enabled. Karpenter can be configured to provision Spot Instances with this specific label based on a NodePool definition. This NodePool can be configured with desired instance types (CPU, memory) to meet the workload’s requirements.
Resource requests: The demo-karpenter-workflow container also specifies resource requests (CPU and memory) to make sure that Kubernetes scheduler allocates sufficient resources for successful execution.
Cost Optimization with Spot Instances:As the workflow demands for resources increase, Karpenter automatically provisions new Spot Instances based on the defined NodePool configuration, scaling the cluster to handle the workload increase effectively. When the workflow is completed, these temporary Spot Instances are voluntarily disrupted with “consolidationPolicy: WhenEmpty”, thus optimizing cost efficiency.

Karpenter is flexible enough to select and provision both Spot Instances and On-Demand Instances based on NodePool configuration. When both Spot Instances and On-Demand Instances are present in the NodePool, Spot is prioritized.

Step 3: Apply the configuration on EKS cluster

Now we deploy the preceding configuration for the demo workflow app using Karpenter NodePool and EC2NodeClass

kubectl apply -f <nodepool>.yaml

kubectl apply -f <nodeclass>.yaml

#Apply the Argo template

kubectl apply -f virtual-test.yaml

Use Argo CLI to submit the workflow.

argo submit -n argo-workflow --from wftmpl/karpenter-workflow-template-demo -p name="Karpeneter demo" -p env_var_1="dev"

When the job is created, Karpenter starts provisioning new nodes.

When the job finishes execution, Karpenter automatically terminates the node to save resources.

Outcome and key considerations

Important outcomes and key considerations are noted below.

Cost optimization and improved node usage

By combining Amazon EKS, Karpenter, Argo Workflows, and Spot Instances, PrintOS achieved a highly dynamic and cost-optimized environment for running their containerized 3D simulation workloads. This approach streamlined their infrastructure management thereby optimizing their resource usage, and delivered cost savings.

Reduced cost: Using Spot Instances with Karpenter resulted in $125K+ yearly cost savings for their non-production workflows.

Improved resource utilization: PrintOS used Karpenter consolidation to enable automatic node termination of unused instances combined with just-in-time provisioning of instances based on dynamic resource demands. In turn, they achieved a 40% improvement in EKS worker node utilization, preventing resource over-provisioning and optimizing cluster resource usage.

Automating worker node upgrade

By using the Karpenter “drift” feature, the PrintOS team saved hours of engineering effort for AMI selection and the EKS managed node group update process. After adopting Karpenter, their node upgrade process has seen an 8x reduction in engineering efforts, reducing the upgrade time from an average of approximately two hours to approximately 15 minutes for each EKS cluster.

For details on how Karpenter drift can be used for automating EKS worker node upgrades, refer to the post on upgrading EKS worker nodes with Karpenter drift.

Important considerations

When adopting a similar architecture as described in this post, it’s essential to keep the following in mind:

Workflow resilience:Design your workflow logic within the workflow-executor template to handle potential Spot Instances interruptions gracefully. Furthermore, implement a retry logic or checkpointing mechanisms that invoke the workflow to resume execution if necessary.
Allow diverse set of instance types: Karpenter selects Spot Instances using the price-capacity-optimized allocation strategy, which balances the price and availability of AWS spare capacity. When designing the NodePool, you should avoid constraining instance types as much as possible. By not constraining instance types, there is a higher chance of acquiring Spot capacity at large scales with a lower frequency of Spot Instance interruptions at a lower cost.
Configure metrics for Karpenter:Karpenter emits several metrics in the Prometheus format that are useful for monitoring and managing your EKS cluster. Refer to the Karpenter documentation for a full list of metrics.

Cleaning Up

To avoid incurring further operational costs, make sure to remove the infrastructure components you created for the examples mentioned in this post.

kubectl delete -f <nodepool>.yaml

kubectl delete -f <nodeclass>.yaml

kubectl delete -f <argo-template>.yaml

Conclusion

In this post, we detailed how HP PrintOS implemented a cost-effective workflow for dynamically provisioning Amazon EKS worker nodes by combining Karpenter with Argo Workflows. We explored the architecture and showcased how you can use Karpenter to streamline node lifecycle management and optimize resource usage to reduce costs.

For further reading, checkout the following links.

Containers