AWS Open Source Blog

Building a multi-tenant Kubeflow environment on Amazon EKS using Amazon Cognito and ADFS

NOTE: Since this blog post was written, much about Kubeflow has changed. While we are leaving it up for historical reference, more accurate information about Kubeflow on AWS can be found here.

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The project’s goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open source systems for ML to diverse infrastructures. Many of our enterprise customers need to integrate Kubeflow with their existing authentication mechanism, such as Active Directory Federation Services (ADFS), to achieve isolation of notebooks, experiments, and more, between various users and user groups.

In this blog post, we explore how to deploy Kubeflow on Amazon Elastic Kubernetes Service (Amazon EKS) and how to achieve multi-user isolation using Amazon Cognito and ADFS. To get an overview of Kubeflow multi-tenancy, please refer to the Kubeflow documentation.

This is a level 300 post and assumes you are familiar with using and configuring AWS services such as Amazon Cognito and AWS Lambda.

Solution Overview

Use Case

Many large companies have a need for multi-tenancy environments for their tools. Kubeflow by default supports multi-tenancy, but only with local users and no namespace segregation.

This solution aims to solve this use case, allowing Kubeflow to integrate with Cognito and Microsoft ADFS as an identity provider and create the preconditions for multi-tenancy and user/group segmentation. And companies will be able to leverage all the benefits Kubeflow brings with integration with their local authentication system, workload isolation, and user segmentation.

The code for this solution can be found in GitHub.

Prerequisites

To get started building a multi-tenant Kubeflow environment solution, you’ll need access to the following:

  • An AWS account
  • Amazon Certificate Manager (ACM) certificate and a valid domain
  • Knowledge about Amazon Cognito, Identity providers and Federation and Kubernetes/Amazon EKS
  • eksctl
  • Minimal knowledge in other AWS services like Amazon Route 53, and AWS Lambda
  • Kubeflow basic administration knowledge
  • ADFS Claims configured to support the Amazon Cognito Configuration
  • Terraform v1.1.5

Purposed Architecture

There are a number of key components to this solution. The first is the identity management piece, configuring Amazon Cognito and an identity provider. Following that, we need to create namespaces and user groups for Kubeflow—not something natively supported by Kubeflow today. To address this, we will deploy a Lambda function that will be used to automatically create those namespaces. The third step is to configure and deploy Kubeflow on Amazon EKS. And finally, we will be configuring the authentication callbacks, so Kubeflow knows where to redirect login requests.

This is what the complete architecture looks like:

Workflow showing the path from an end user into Kubeflow

The above diagram describes the typical user workflow, logging in ADFS, and getting redirected to Kubeflow in their respective namespace.

  1. User opens the custom URL for the Kubeflow ALB
  2. Route 53 resolves the FQDN to the ALB
  3. ALB redirects the request to the Kubeflow containers running on EKS
  4. IstioGateway intercepts the request
  5. IstioGateway redirects the request to Amazon Cognito
  6. Amazon Cognito App client is configured with a custom Domain and valid certificate from Amazon Certificate Manager
  7. Amazon Cognito redirects the request to the configured identify provider (IdP) (ADFS in this example)
  8. User logs in and ADFS redirects to Amazon Cognito with the SAML assertion
  9. Amazon Cognito has the Pre-Token generation event configured to trigger the Lambda function
  10. “Profile Manager” Lambda function is triggered with the groups that the user belongs in the AD as part of the event
  11. The Lambda function, based on the groups received on the event will connect to the Amazon EKS and generate the necessary role and serviceRoleBindings
  12. Amazon Cognito redirect back to the Kubeflow UI

Implementation

To deploy Kubeflow on Amazon EKS please follow the below steps (A full explanation of each step can be found in the following section.):

  1. We need to create to own a valid domain and create the necessary certificates to be used in later steps by the Amazon Cognito Domain URL and the ALB for Kubeflow (Step One – AWS ACM)
  2. As part of the integration, Kubeflow doesn’t support the automatic creation of namespaces and user groups. To enable this functionality, we will create a new Lambda function (we called it Profile Manager) that will be triggered as part of the login flow by Amazon Cognito. The steps and code to support this implementation can be found below (Step Two – Namespace Automation)
  3. Once the Microsoft ADFS is configured, we can now proceed to configure Amazon Cognito as the Identity Provider for Kubeflow. The workflow details and configuration to achieve this can be found in the Amazon Cognito IdP Integration section. (Step Three – Amazon Cognito IdP Integration Flow)
  4. After the IdP is set up, we can proceed with the creation of the Amazon EKS Cluster. (Step Four – Amazon EKS Cluster Creation)
  5. After the Cluster is created and the nodes are in service. we can proceed with the Kubeflow installation. (Step Five – Kubeflow Integration)
  6. As a last step, and as we create the LoadBalancer with the Kubeflow deployed, we need to get the URL of the LoadBalancer, create the DNS record in Amazon Route 53 (we assume the certificate for this alias exists and is valid), and configure the Callback URL in Amazon Cognito. This lets the Login workflow know where to redirect (Kubeflow Entrypoint) after the authentication is successful. (Step Six – Configure CallBack on Amazon Cognito)
  7. As all the steps are now complete, we can log in to Kubeflow using the DNS entry created in the previous step. (Step Seven – Log in to Kubeflow)

Step One – AWS ACM (Certificate Manager)

Generate a certificate in AWS ACM and then validate it with Route 53 using the steps below:

NOTE: Certificates must be created in the US-EAST-1 Region, otherwise they will not be visible for Amazon Cognito.

  1. Make sure you own a domain and it’s registered in Route 53 (we have provided a sample domain below).
    1. Screenshot of the Route 53 console showing a hosted zone example
  2. Go to the AWS Certificate Manager console and select Request a Certificate in the left pane.
    1. Screenshot of the AWS Certificate Manager console showing the Request a Certificate option
  3. Add the following domains to the certificate:
    1. <mydomain>
    2. kubeflow.<mydomain>
      1. Needed for the ALB in a later step
    3. auth.<mydomain>
      1. Needed by Cognito to validate the sub-domain afterwards
    4. cognito.auth.<mydomain>
      1. Needed for the Custom Domain in Amazon Cognito
    5. screenshot of AWS Certificate Manager showing how to request a public certficate
  4. Press “Request” and wait a few minutes for the certificate to be validated in Route 53.
  5. Once the certificate is valid, you should see something similar to the image below.
    1. Screenshot from the AWS Certificate Manager showing the issued certificate

Step Two – Namespace Automation (Profile Manager)

This section describes how to create the Lambda function to manage the creation of the namespaces and group during the login workflow. The “Profile Manager” will be triggered by Amazon Cognito (this configuration is in the previous section) after the authentication succeeds, but before the redirection to Kubeflow.

How it works

  • User tries to log in to Kubeflow.
  • The Amazon Cognito flow will start and contact the ADFS to validate the user credentials.
  • An Amazon Cognito trigger for “Pre-Auth token” will trigger the Lambda function KF_profile_Manager
  • This Lambda function will:
    • Create user owner profile with the name “profile-<username>” as this is mandatory for Kubeflow. This profile has no resource permissions, hence nothing can be created.
    • Check if the group namespaces exist and/or create them.
    • Based on groups from the ADFS, named with a certain prefix, will create the rolebindings and the serviceRolebindings to add the user to the respective groups namespaces.

Python requirements

urllib3>=1.25.10
PyYAML>=5.4.1
rsa==4.5.0
six==1.15.0
botocore>=1.20.112
s3transfer>=0.4.0
kubernetes
eks-token

Profile Lambda function

  • Lambda function code can be found in GitHub
  • Go to the AWS Lambda Console and create a new Function with Python3.9 as interpreter
  • Timeout should be increased to 30 seconds and Memory to 256 MB
  • handler: index.lambda_handler
  • The function expects the following environment variables:
    • The Lambda Environment Variable ClusterName, needs to be updated to the correct EKS cluster name
    • Configure the following environment variables:
      • clusterName = <clusterName>
        cognitoGroupField = “custom:groups”
  • Additionally, we need to make sure that Amazon Cognito has permissions to invoke the Lambda function.
    • Go to the AWS Lambda function, and select the Configuration Tab
    • Select Permissions from the left hand menu
    • And from the Resource-based policy, click on Add Permissions
  • Screenshot of the Lambda Add Permissions dialog demonstrating how to configure it. Specifically, it shows a policy statement for an AWS service, of type Other, where the Principal is set to cognito-idp.amazonaws.com. The Source ARN should be copied from the Cognito console. The Action to allow is lambda:invokeFunction and the suggested Statement ID is Allow_cognito_to_invoke_lambda
  • The Lambda function is now ready to be used by Amazon Cognito to create the namespaces and the respective group membership in Kubeflow.

Brief explanation of the function contents

  • Role Binding Function
    • Role Binding grants the permissions defined in a role to a user or set of users. It holds a list of subjects (users, groups, or service accounts), and a reference to the role being granted.
    • Role Binding grants permissions within a specific namespace whereas a ClusterRoleBinding grants that access cluster-wide.
  • Service Role Binding
    • Istio needs a service role binding to allow the user to have permissions in a certain namespace
    • A ServiceRoleBinding specification includes two parts:
      • The roleRef field that refers to a ServiceRole object in the same namespace.
      • A list of subjects that are assigned the roles.

Step Three – Amazon Cognito IdP Integration Flow

This section describes how to configure Amazon Cognito with Microsoft ADFS as an Identity provider. The diagram below describes the user journey to authenticate.

Diagram showing the user journey to authenticate. 1. The user is directed to the UI. 2. The IDP is determined. 3. Redirect to IDP 4. User Authenticated by IdP (SSO if active) 5. Post Back with SAML Assertion 6. Pre-Token Create Profile 7. Success 8. Amazon Cognito tokens provided to App

Cognito User Pool and Identity Provider Configuration

We need to create the Amazon Cognito configuration, creating a user pool, identity provider, and link it to the ADFS IDP. We recommend you read the following blog post for a deep dive into Amazon Cognito before proceeding with the steps below, as we will cover only the steps necessary for this configuration in this blog post.

The following steps will need to be completed in the AWS console:

  1. Create a User Pool
    1. Add the following attributes:
      1. Select Username and allow validated email as secondary login
      2. In the attribute section select email as mandatory
      3. In the custom attributes create the following:
        1. Name: groups
        2. Type: String
        3. Min Length: 1
        4. Max Length: 2048
      4. Screenshot of the Create a user pool dialog in Cognito
  2. Create an APP Client
    1. Give a name to the App Client
    2. Enable Security
    3. Select Generate client secret
    4. Press Create app client
    5. Image showing the Cognito dialog to create the App client.
  3. In the triggers section, choose a “Pre Token Generation” trigger to point the Lambda function that will manage the profiles in Kubeflow (Lambda function in this section is the pre-condition to select this option)
    1. Screenshot showing Lambda triggers dialog
  4. Create a SAML Identity Provider with the following mappings:
    1. To create the SAML Provider, you will need to use the metadata file from your Microsoft ADFS environment (or alternatively, use a URL in which you can download this file)
    2. Screenshot of Kubeflow dialog showing identity provider selection
    3. Edit the attribute mappings and add the following:
      1. https://thinkwithwp.com/SAML/Attributes/validation → Email Verified
      2. https://thinkwithwp.com/SAML/Attributes/Role → custom:groups
      3. http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress → emailaddress
    4. Screenshot showing SAML attribute mappings
  5. Go back to the APP client and select the following options:
    1. Enable Identity providers
      1. ADFS
    2. Under Oauth 2.0, enable “Authorization code grant” under Allowed OAuth Flows, and then select
      1. Email
    3. Callback URL is the Route 53 Domain domain selected (Domain that was used in the second step for the certificate creation). This will also be used in a later step and Alias created for the kubeflow ALB
      1. IE: https://kubeflow.<mydomain>/oauth2/idpresponse
    4. Sign Out URL is optional
      1. This can be built based on the following template “https://<CognitoCustomDomainURL>/logout?client_id=<APPID>&logout_uri=https://<kubeflowURL>/logout”
      2. CognitoCustomDomainURL is the url from the customDomain Creation
      3. APPID is the Cognito APPID, identified on the top of the screenshot below
      4. kubeflowURL is the Alias created for the ALB of Kubeflow
    5. Screenshot showing Cognito App client dialog
  6. As a final step, we need to update the ADFS trust relation with the Amazon Cognito ARN in the Microsoft ADFS server
    1. Go to your ADFS server
    2. Go to the Federation Console and open the Relaying Party trust
    3. ADFS Federation Console, showing the Relaying Party trust settings
    4. Make sure that the identifier is correct
    5. The region and the Amazon Cognito Pool ID should be updated to point to the one just created in this section
    6. Screenshot of the ADFS Cognito properties dialog
    7. Make sure also, that the Endpoints have the correct endpoint for your domain
    8. It should point to https://cognito.auth.<domain>/saml2/idpresponse
    9. ADFS Cognito Endpoint properties dialog

Step Four – Amazon EKS Cluster Creation

This section describes how to install Kubeflow in a running Amazon EKS Cluster and all the necessary configurations to integrate with the previous sections.

To create a cluster using eksctl we can use the following steps:

  1. Create a YAML file using the following command:
    1. cat << EOF > cluster.yaml
      ---
      apiVersion: eksctl.io/v1alpha5
      kind: ClusterConfig
      
      metadata:
        name: <myClusterName>
        version: 1.18
        region: us-east-1
      
      vpc:
        subnets:
          private:
            us-east-1a: { id: xxx }
            us-east-1b: { id: xxx }
          public:
            us-east-1a: { id: xxx }
            us-east-1b: { id: xxx }
      
      managedNodeGroups:
        - name: kubeflow-mng
          desiredCapacity: 3
          instanceType: m5.large
      EOF
  2. Run the eksctl command to create the Amazon EKS cluster
    • eksctl create cluster -f cluster.yaml
  3. Wait until the cluster and nodes are created (usually takes around 15 minutes) before proceeding to the next step.

Step Five – Kubeflow installation

After the Amazon EKS Cluster is deployed we can deploy Kubeflow. Kubeflow has the following prerequisites:

Create the Amazon EKS Cluster. Create a file called install_kubeflow.sh and paste the following content:

#!/bin/bash
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_aws_cognito.v1.2.0.yaml"
# 3. Set an environment variable for your AWS cluster name.
export AWS_CLUSTER_NAME=kubeflowworkshop-eksctl

# 4. Create the directory you want to store deployment, this has to be ${AWS_CLUSTER_NAME}
mkdir ${AWS_CLUSTER_NAME} && cd ${AWS_CLUSTER_NAME}

# 5. Download your configuration files, so that you can customize the configuration before deploying Kubeflow.
wget -O kfctl_aws.yaml $CONFIG_URI

Run the script that will download the manifest file for Kubeflow version 1.2.0 with Amazon Cognito Integration

./install_kubeflow.sh

This will download this file:

apiVersion: kfdef.apps.kubeflow.org/v1 kind: KfDef metadata:   namespace: kubeflow spec:   ...   plugins:   - kind: KfAwsPlugin     metadata:       creationTimestamp: null       name: aws     spec:       auth:         cognito:           certArn: <valid ACM certificate >           cognitoAppClientId: <CognitoAppClientID>           cognitoUserPoolArn: <CognitoUserPoolArn>           cognitoUserPoolDomain: <CognitoUserPoolDomain>       enablePodIamPolicy: false       region: <Region>       roles:       - <EksClusteRole>   repos:   ...

Modify the file downloaded and replace the following fields:

      • CertARN:
        • Certificate arn from the Certificate created in the section
        • arn:aws:acm:<region>:<accountID>:certificate/<acm cert id>
      • cognitoAppClientId:
        • App Client ID from Amazon Cognito
      • cognitoUserPoolArn:
        • Arn from the Amazon Cognito User Pool
        • arn:aws:cognito-idp:<region>:<accountID>:userpool/<PoolID>
      • cognitoUserPoolDomain:
      • region: <Region>
      • EKSClusterRole:
        • This is the role of the worker nodes from the Amazon EKS cluster
        • Can be found by issuing the following command:
aws iam list-roles \
    | jq -r ".Roles[] \
    | select(.RoleName \
    | startswith(\"eksctl-$AWS_CLUSTER_NAME\") and contains(\"NodeInstanceRole\")) \
    .RoleName"
Shows results from aws iam list-roles command execution

It is a prerequisite to have the Amazon Cognito UserPool created before running the creation of the Cluster. You should also have it linked to ADFS. For more details, refer to “Step Three – Amazon Cognito IdP Integration Flow” above.

Setup the environment variables to point to your config file

export AWS_CLUSTER_NAME=eksworkshop-eksctl
export KF_NAME=${AWS_CLUSTER_NAME}

export BASE_DIR=/Users/<user>/environment
export KF_DIR=${BASE_DIR}/${KF_NAME}

export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml <- This is the file configured before

Once the environment is set up we can proceed with the Kubeflow installation.

CD ${BASE_DIR}
kfctl apply -V -f kfctl_aws.yaml
Screenshot showing kfctl output in a terminal window

Wait until the process finishes.

kfctl output showing the process finishing

Check if all pods are up and running with the following command.

kubectl get pods -A
output from kubectl get pods command

Kubeflow is now installed and we are ready to proceed with the next steps.

Check/add the lambda role to aws-auth to authorize the lambda to call the EKS APIs

Run the following command to edit aws-auth configmap.

kubectl edit configmap -n kube-system aws-auth

Please add the mapUsers section (in bold) to authorize the Lambda function to connect to Amazon EKS.

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
mapRoles: |
- rolearn: arn:aws:iam::<accountid>:role/<kubeflowNodeIAMROLE>
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
mapUsers: |
- userarn: arn:aws:iam::<accountid>:role/<lambdaIAMRole>
username: eksadmin
groups:
- system:masters

kind: ConfigMap
metadata:
creationTimestamp: "2021-10-19T14:21:37Z"
name: aws-auth
namespace: kube-system
resourceVersion: "1198"
selfLink: /api/v1/namespaces/kube-system/configmaps/aws-auth
uid: 660272be-0aea-4b3d-a658-e4d93d964377

Step Six – Configure CallBack on Amazon Cognito

For Oauth to work we need to configure the callback URL in Amazon Cognito. This URL will be the ALB created with eksctl in the previous step. The following steps to configure the callback:

    1. Get the URL for the callbackkubectl get ingress istio-ingress -n istio-system
      Screenshot showing results from kubectl get ingress istio-ingress -n istio-system
    1. Create a Route 53 CNAME for the ELB
      • This should match the certificate created on AWS Certificate Manager.
      • Copy the “ADDRESS” from the terminal and go to Amazon Cognito App Client Settings to update the callback URL
      • ALB address should be an alias to kubeflow.<mydomain>

Step Seven – Log in to Kubeflow

Now that we have finished the steps, we can test by logging into our Kubeflow environment. In a new browser tab, type in the address bar the Route 53 CNAME that was created for ELB (in our example above, this is kubeflow.<mydomain).

During the login process, Kubeflow will redirect us to Cognito, which will then redirect to the ADFS Login Page. After filling the credentials, ADFS will redirect back to Amazon Cognito.

Before Amazon Cognito redirects us to the Kubeflow URL, the Lambda function will trigger (“Pre Token Generation”) and create the namespaces in Amazon EKS and add our membership.

Finally, we will end in the Kubeflow Main Page with access to the shared namespaces(groups), allowing team collaboration on resources.

Central Dashboard

Screenshot showing Kubeflow Central Dashboard

Cleaning up

Cleanup is the exact reverse steps:

  • To delete just Kubeflow:
    • Set up the environment variables to point to your config file
    • export AWS_CLUSTER_NAME=eksworkshop-eksctl
      export KF_NAME=${AWS_CLUSTER_NAME}
      export BASE_DIR=/Users/<user>/environment
      export KF_DIR=${BASE_DIR}/${KF_NAME}
      export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml <- This is the file configured before
    • Once the environment is set up, we can proceed with the Kubeflow installation
    • CD ${BASE_DIR}
      kfctl delete -V -f kfctl_aws.yaml
  • To cleanup the infrastructure(GitHub code)
    • terraform destroy -auto-approve

Conclusion

Kubeflow is a great engine for ML workflows on Kubernetes. It’s simple, portable, and scalable. In this post we walked you through how to extend your Kubeflow environment to achieve multi-tenancy using Amazon Cognito and Microsoft ADFS as an Identity source provider. While we used Microsoft IdP as the IdP in this post, any other third-party solution that can integrate with Amazon Cognito can be used, IE: OpenSSO.

Jose Alfredo de Castro Nunes

Jose Alfredo de Castro Nunes

I am a Cloud Infrastructure Architect Consultant at AWS. I'm tech savvy and love to bring new technologies to design new solutions.

Krithivasan Balasubramaniyan

Krithivasan Balasubramaniyan

Krithivasan is a Principal Consultant at AWS. He enables global enterprise customers in their digital transformation journey and helps architect cloud native solutions.

Rumeshkrishnan Mohan

Rumeshkrishnan Mohan

Rumeshkrishnan Mohan is a Big Data Consultant with Amazon Web Services. He works with Global Customers in building their data lakes.