Modernizing Platform Management: Hilti Group’s Approach to Improving Kubernetes Multicluster Operations

This blog post was originally focused on using Karpenter to manage worker nodes in Amazon EKS clusters. With the release of a new capability called Amazon EKS Auto Mode, we encourage you to use this more streamlined experience for managing Amazon EKS infrastructures.

Introduction

The Hilti Group supplies the worldwide construction industry with technology leading products, systems, software and services. This includes a range of construction software solutions tailored to various stages of the construction process, such as ON!Track for equipment management or Fieldwire for seamless jobsite management. All of these products and solutions follow Hilti Group’s purpose “Making Construction Better”, driving productivity in the construction industry.

As of early 2023, Hilti Group’s microservices-based software solutions were deployed on self-managed Kubernetes clusters. The upgrade process for Kubernetes clusters proved to be both challenging and time-consuming, with the team spending months to transition from one version to the next. Furthermore, managing node groups with Cluster Autoscaler required manual interventions to rotate the nodes during upgrades and limited the ability to leverage different Amazon Elastic Compute Cloud (Amazon EC2) offerings for for new node groups. Recognizing these inefficiencies, the Hilti Group initiated a comprehensive platform modernization project. The primary goals were to streamline the upgrade process and reduce operational overhead.

This blog post details Hilti Group’s journey to streamline their upgrade process and minimize operational overhead and reducing cost by 20% along the way. We explore their transition from self-managed Kubernetes to Amazon Elastic Kubernetes Service (Amazon EKS), a managed Kubernetes service to run Kubernetes in the AWS cloud. We will also showcase their implementation of GitOps-based cross-account cluster and node management using Flux, Crossplane and Karpenter.

Solution Overview

Managing Kubernetes added a significant operational burden for the Hilti Group, it slowed down innovation and consumed a considerable part of the DevOps team bandwidth. Hilti Group’s adoption of Amazon EKS allowed them to offload this operational burden to AWS. The solution Hilti Group implemented is based on a hub-and-spoke model comprising a management cluster and multiple workload clusters. The management cluster is used for provisioning, bootstrapping, and managing the workload clusters. The workload clusters are used for running applications.

The Hilti Group uses Crossplane, a Cloud Native Computing Foundation (CNCF) project, that orchestrates infrastructure resources for provisioning and managing AWS resources. Crossplane provisions and manages AWS resources that applications depend on, such as Amazon Simple Storage Service (Amazon S3) buckets and Amazon Relational Database Service (Amazon RDS) databases, as well as the Amazon EKS clusters the applications run on. Flux, a tool for keeping Kubernetes clusters in sync with sources of configuration (like Git repositories), is used as a GitOps controller.

Flux and Crossplane are deployed in the management cluster and the workload clusters. Flux in the management cluster reconciles the workload clusters manifests, which are then referenced by Crossplane to provision and manage the workload clusters. Flux in the workload cluster reconciles applications manifests, including manifests for AWS resources the application depends on, such as an Amazon S3 bucket. AWS resources manifests are then acted upon by Crossplane in the workload cluster.

Karpenter, an open source node lifecycle management project built for Kubernetes, quickly reacts to scale out events by provisioning additional compute capacity to accommodate new pods. Karpenter continuously monitors cluster workloads for opportunities to consolidate the compute capacity for better node utilization and cost efficiency. Karpenter also allows for using Spot and Graviton instances unlocking further cost reductions and performance gains. The Hilti Group leverages Karpenter’s drift detection feature to ensure nodes are always running on the latest available AMI, allowing their engineers to work on other more relevant tasks. NodePool Disruption Budgets are used to ensure updates happen at desired times and system performance remains stable during peak loads. Additionally, by leveraging the Karpenter NodePool configuration, the Hilti Group could reduce inter-AZ (Availability Zone) data transfer costs in their development environments. This was achieved by limiting the deployment to a single Availability Zone, as high availability was not a critical requirement for such environments.

Flux and Crossplane are used in tandem to deploy and manage Karpenter on workload clusters. The following diagram depicts the high-level architecture of the solution:

Adoption strategy

In the following sections, we dive deeper into specific segments within the architecture shown previously.

Creating and configuring the management cluster

In this subsection, we will review how Hilti Group’s platform team sets up the management cluster and permissions. This setup allows developers to deploy a simple YAML file in their team namespace using Flux, which triggers the creation of workload clusters.

First, the platform team uses Terraform to create the management cluster and install Crossplane and Flux. Keeping some Terraform code is necessary to accommodate the initial setup.

Next, in the management cluster the platform team sets up the following constructs:

A custom API using Crossplane’s Composite Resource Definition (XRD) – not shown in the diagram. The XRD defines the kind of the new API, in this example EKSCluster, and specifies the spec fields available to developers for various configuration options such as the EKS version. Many of these fields have pre-set sensible defaults by the platform team. For instance, the version field defaults to the latest EKS version used across Hilti Group’s fleet.
A Crossplane Composition which defines what resources will be created and how, when the API is called.
Crossplane EnvironmentConfigs which are cluster scoped custom resources that store non-sensitive data similar to Kubernetes ConfigMaps. As shown in the diagram, the management cluster contains multiple EnvironmentConfigs, each containing data for the destination environment.
Crossplane ProviderConfigs which determine the authentication method and credentials to be used to deploy to each workload account.

The following diagram shows in detail the pattern used to structure the management cluster, and the necessary cross-account permissions for managing the lifecycle of the workload clusters.

Provisioning workload clusters

In this subsection, we will review how development teams provision a workload cluster.

Developers call the custom EKSCluster API by creating YAML files, known as Crossplane Claims, within their namespaces to provision workload clusters. A developer on team-1 can request an environment, including an Amazon Virtual Private Cloud (Amazon VPC) and an Amazon EKS cluster, by checking a Claim YAML into Git. Flux deploys the claim into the corresponding namespace in the cluster.

The following code is a sample of a Developer Claim of kind EKSCluster that will be deployed in the team-1-dev namespace. The intention of this Claim is to create an Amazon EKS cluster in team-1’s dev account.

apiVersion: awsblueprints.io/v1alpha1
kind: EKSCluster
metadata:
  name: my-eks-cluster-name-1
  namespace: team-1-dev
spec:
...

Once Flux deploys the claim, the claim selects the Composition created during step 2, in the previous subsection.

The following code is a sample of a Crossplane Composition designed to select the appropriate Crossplane EnvironmentConfig and ProviderConfig based on the namespace of the Claim.

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
...
spec:
...
  environment:
    environmentConfigs:
      - type: Selector
        selector:
          matchLabels:
            - key: namespace
              type: FromCompositeFieldPath
              valueFromFieldPath: metadata.labels[crossplane.io/claim-namespace]
…
  patchSets:
  - name: common-patches
    patches:
    - type: FromEnvironmentFieldPath
      fromFieldPath: providerConfig
      toFieldPath: spec.providerConfigRef.name
...

The environment section is responsible for selecting the appropriate EnvironmentConfig for the Composition. In this case, the Composition uses a Selector type EnvironmentConfig, which allows it to select the EnvironmentConfig based on labels. The selector.matchLabels field specifies the label that the Composition should look for. In this case, it’s matching the namespace label of the claim, which is populated from the metadata.labels[crossplane.io/claim-namespace].

The patchSets section of the Composition defines reusable sets of patches that can be applied to the managed resources. In this example, there is a patch set named common-patches, which contains a single patch. The patch copies the providerConfig field from the EnvironmentConfig to the spec.providerConfigRef.name field of the patched resource.

The following code is a sample of an EnvironmentConfig containing the ProviderConfig name and the account details of the target account.

apiVersion: apiextensions.crossplane.io/v1alpha1
kind: EnvironmentConfig
metadata:
  name: team-1-dev
data:
  region: us-east-1
  accountID: 111122223333
  providerConfig: team-1-dev
...

The following code sample is a ProviderConfig containing the authentication method and credentials to be used to deploy to the target workload account.

apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: team-1-dev
spec:
  assumeRoleChain:
    - roleARN: arn:aws:iam::111122223333:role/team-1-dev
  credentials:
    source: IRSA

By assigning a ProviderConfig to each team, a specific team role with minimal access required to provision an Amazon EKS cluster in the target account can be assumed. Additionally, all team activities are logged in AWS CloudTrail. Using IAM Roles for Service Accounts (IRSA), a mechanism for associating Kubernetes service accounts with AWS Identity and Access Management (IAM) roles, for authentication ensures that credentials remain short-lived and dynamic, eliminating the need for static, long-lived access keys.

Using this Crossplane configuration pattern of linking by naming convention a Kubernetes namespace, an EnvironmentConfig, and a ProviderConfig for each workload environment, the platform team can provide a self-service experience for developers to create Amazon EKS clusters in multiple AWS accounts. This approach ensures that each team has the necessary permissions to manage their own resources, while maintaining centralized control and visibility over the overall infrastructure.

Deploying Karpenter onto workload cluster using Flux and Crossplane

Once the workload cluster is provisioned, Karpenter needs to be running on it before onboarding any workloads to provision the required capacity.

Flux in the management cluster is used for deploying Karpenter into the workload clusters using Flux Remote clusters configuration. KubeConfig for the workload cluster is created as part of cluster build using Crossplane, and stored as a Kubernetes Secret in the management cluster. Flux Kustomization in the management cluster deploys Karpenter onto the workload cluster, and it references the KubeConfig secret for authenticating and connecting to the workload cluster.

Karpenter requires an IAM role and policy to be able to interact with AWS APIs for provisioning and terminating Amazon EC2 instances. Additionally, Karpenter can be configured to watch for and handle involuntary interruption events that would cause disruption to workloads e.g. Spot interruption warnings and Amazon EC2 maintenance events. Karpenter requires an Amazon Simple Queue Service (Amazon SQS) queue be provisioned and Amazon EventBridge rules and targets be added that forward such interruption events from AWS services to the Amazon SQS queue. To streamline the creation of the mentioned AWS resources, a Crossplane composition with the definitions of these AWS resources is created, and a claim of it is applied to the management cluster prior to the installation of Karpenter helm chart on a workload cluster.

The following diagram illustrates the workflow of deploying Karpenter to remote workload clusters.

Conclusion

By implementing the solution outlined in this blog, the Hilti Group successfully streamlined its upgrade process and reduced operational overhead. Kubernetes upgrades, which previously required months, can now be completed within a few days.

The migration from self-managed Kubernetes to Amazon EKS reduced Hilti Group’s management scope, and allowed the team to focus on what matters more for the business rather than managing Kubernetes control plane. The adoption of Crossplane has allowed for managing infrastructure using Kubernetes APIs, positioning the Hilti Group to further enhance its self-service capabilities.

The integration of Karpenter has significantly improved operations and cost-efficiency. The Hilti Group reduced operational overhead by leveraging features such as drift detection. Furthermore, Karpenter has led to cost reduction with the enablement of automatic consolidation, adoption of Graviton and Spot instances, and eliminating inter-Availability Zone (AZ) data transfer for development environments.

For additional guidance about the subjects covered in this blog post, check the resources below:

__________________________________________________________________________________________________________________________________________________________________

AWS in Switzerland and Austria (Alps)