AWS Big Data Blog

Set up CI/CD pipelines for AWS Glue DataBrew using AWS Developer Tools

An integral part of DevOps is adopting the culture of continuous integration and continuous delivery (CI/CD). This enables teams to securely store and version code, maintain parity between development and production environments, and achieve end-to-end automation of the release cycle, including building, testing, and deploying to production. In essence, development teams follow CI/CD processes to deliver higher-quality software frequently and predictably.

CI/CD practices apply beyond software delivery, and in this post we focus on bringing the same best practices to data preparation for AWS Glue DataBrew. DataBrew is a visual data preparation tool that makes it easy to profile and prepare data for analytics and machine learning (ML). We demonstrate how to integrate DataBrew with services from the AWS Developer Tools Suite in order to automate the release process for DataBrew’s no-code data preparation recipes.

Overview of solution

In this post, we walk through a solution that uses AWS CodePipeline to automatically deploy DataBrew recipes maintained in an AWS CodeCommit repository to both pre-production and production environments. The pipeline is triggered when users push a change to a DataBrew recipe through CodeCommit. It then updates and publishes a new revision of the recipe to both pre-production and production environments using a custom AWS Lambda deployer.

The pipeline has three stages, as outlined in the following architecture diagram:

  • Source – A source stage for your CodeCommit source action
  • Deploy – Preprod – A deployment stage for your CodeDeploy deployment action to pre-production
  • Deploy – Prod – A deployment stage for your CodeDeploy deployment action to production

The steps in this solution are as follows:

  1. Developers push a new or updated DataBrew recipe JSON definition to a CodeCommit repository.
  2. The source change triggers a CodePipeline transition to a pre-production deployment stage.
  3. AWS CodeDeploy pushes the updated recipe artifacts to Amazon Simple Storage Service (Amazon S3).
  4. CodeDeploy invokes a Lambda function as a custom deploy action.
  5. The Lambda function updates the relevant DataBrew recipe in the pre-production account and publishes the version.
  6. CodePipeline transitions to the production deployment stage, and repeats the same process to update the DataBrew recipe in the production account.

In the scope of this solution, we pass directly from pre-production to production deployment stages. However, based on your business requirements, we recommend extending the solution to include verification steps (functional, integration, and performance tests) before deploying the recipe to production.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • A Git client that supports Git version 1.7.9 or later. If you don’t have a Git client, you can download one and install it.
  • git-remote-codecommit installed locally (you can get this from the Python package index).
  • Four AWS accounts:
    • Infrastructure – Where the infrastructure resources reside, which includes the CodeCommit repository, CodePipeline pipeline, CodeDeploy application and deployment group, Lambda function, and related AWS Identity and Access Management (IAM) permissions. This is the account we use to deploy the AWS CloudFormation
    • User – The account mimicking users across your organization.
    • Pre-production – The account storing pre-production DataBrew resources.
    • Production – The account storing production DataBrew resources.
  • Roles under the production and pre-production accounts that provide the infrastructure account access to list, create, update, and publish DataBrew recipes.

We use four AWS accounts because a multi-account AWS environment is a best practice that provides a higher level of resource isolation. Each account provides a natural security and access boundary, and is only allocated the resources for each organizational unit.

Create the prerequisite roles

We use IAM roles to delegate cross-account access to DataBrew resources. We need cross-account access in two areas:

  • The production and pre-production accounts own the DataBrew recipe resources that are accessed and updated via the infrastructure account. Roles are used to establish a trust relationship between the trusting accounts (pre-production and production) and the trusted account (infrastructure).
  • The user account requires access to the repository created in the infrastructure account. This allows developers to locally clone the shared repository.

Pre-production account

To set up the policy and role for cross-account recipe access in the pre-production account, complete the following steps:

  1. Sign in to the AWS Management Console using the pre-production account.
  2. On the IAM console, in the navigation pane, choose Roles, then choose Create Role.
  3. Under Select type of trusted entity, choose Another AWS Account.
  4. For Account ID, enter the AWS account ID for the infrastructure account.
  5. Choose Next: Permissions.
  6. Choose Create policy. A new browser tab opens for creating the required policy.
  7. On the JSON tab, enter the following permissions:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "databrew:ListRecipes"
                    "databrew:CreateRecipe",
                    "databrew:PublishRecipe",
                    "databrew:UpdateRecipe"
                ],
                "Resource": "*",
                "Effect": "Allow"
            }
        ]
    }
  8. Choose Next: Tags, then choose Next: Review.
  9. For Name, enter a name for the policy.
  10. Review the policy permissions and choose Create policy.
  11. Close the tab and return to the original tab to complete creating the role.
  12. Choose the refresh button and search for the policy name created in the previous step.
  13. Select the policy.
  14. Choose Next: Tags, then choose Next: Review.
  15. For Name, enter a name for the role. Make sure to note down the role name because you use it later in this tutorial.
  16. Choose Create role.

Production account

To create the cross-account access role in the production account, we follow the same steps as required for the pre-production account. Sign in to the AWS Management Console using the production account and repeat the steps in the previous section.

Infrastructure account

To set up the policy and role for cross-account repository access in the infrastructure account, complete the following steps:

  1. Sign in to the console using the infrastructure account.
  2. On the IAM console, in the navigation pane, choose Roles, then choose Create Role.
  3. Under Select type of trusted entity, choose Another AWS Account.
  4. For Account ID, enter the AWS account ID for the user account.
  5. Choose Next: Permissions.
  6. Choose Create policy. A new browser tab opens.
  7. On the JSON tab, enter the following permissions:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "codecommit:BatchGet*",
                    "codecommit:Create*",
                    "codecommit:DeleteBranch",
                    "codecommit:Get*",
                    "codecommit:List*",
                    "codecommit:Describe*",
                    "codecommit:Put*",
                    "codecommit:Post*",
                    "codecommit:Merge*",
                    "codecommit:Test*",
                    "codecommit:Update*",
                    "codecommit:GitPull",
                    "codecommit:GitPush"
                ],
                "Resource": [
                    "arn:aws:codecommit:us-east-1:infrastructure-account-ID:DataBrew-Recipes-Repo"
                ]
            },
            {
                "Effect": "Allow",
                "Action": "codecommit:ListRepositories",
                "Resource": "*"
            }
        ]
    }
  1. Choose Next: Tags, then choose Next: Review.
  2. For Name, enter a name for the policy.
  3. Review the policy permissions and choose Create policy.
  4. Close the tab and return to the original tab to complete creating the role.
  5. Choose the refresh button and search for the policy name created in the previous step.
  6. Select the policy.
  7. Choose Next: Tags, then choose Next: Review.
  8. For Name, enter a name for the role. To match the provided JSON for the user account policy, use the name CrossAccountRepositoryContributorRole.
  9. Choose Create role.

User account

To set up the policy, role, and user for cross-account repository access in the user account, complete the following steps:

  1. Sign in to the console using the user account.
  2. On the IAM console, in the navigation pane, choose Users, then choose Add User.
  3. For User name, enter a name for the user.
  4. For Access type, select Programmatic access and AWS Management Console access.
  5. Choose Next: Permissions.
  6. For Add user to group, choose Create group.
  7. For Group name, enter a name that describes the job function for users accessing the DataBrew recipe repository.
  8. Choose Create policy. A new browser tab opens.
  9. On the JSON tab, enter the following permissions:
    {
      "Version": "2012-10-17",
      "Statement": {
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        "Resource": "arn:aws:iam::infrastracture-account-ID:role/CrossAccountRepositoryContributorRole"
      }
    }
  1. Choose Next: Tags, then choose Next: Review.
  2. For Name, enter a name for the policy.
  3. Review the policy permissions and choose Create policy.
  4. Close the tab and return to the original tab to complete creating the role.
  5. Choose the refresh button and search for the policy name created in the previous step.
  6. Select the policy.
  7. Choose Create group. On the Add user page, the newly created group is now highlighted.
  8. Choose Next: Tags, then choose Next: Review.
  9. Choose Create user.
  10. Choose Download .csv to download the security credentials for the user, then choose Close.

Deploy the solution

For a quick start of this solution, you can deploy the provided CloudFormation stack. This creates all the required resources in your account (us-east-1 Region). Follow the rest of this post for a deeper dive into the resources.

  1. Sign in to the console using the infrastructure
  2. Choose Launch Stack:
  3. For ApplicationName, optionally enter a name for the CodeDeploy application.
  4. For BranchName, optionally enter a name for the default CodeCommit branch.
  5. For PreProdDataBrewAccessRole, enter a role under the pre-production account that is assumed by the infrastructure account to create and publish DataBrew recipes.
  6. For ProdDataBrewAccessRole, enter a role under the production account that is assumed by the infrastructure account to create and publish DataBrew recipes.
  7. For RepositoryName, optionally enter a name for the CodeCommit repository.
  8. Select I acknowledge that AWS CloudFormation might create IAM resources.
  9. Choose Create stack.

It takes a few minutes for the stack creation to complete; you can follow its progress on the Events tab.

Set up your CodeCommit repository for recipe storage

CodeCommit is a fully managed source control service that makes it easy to host secure and highly scalable private Git repositories. Setting up a CodeCommit repository for recipes streamlines the development process, allowing multiple developers to collaborate, track changes, and revert to previous versions if necessary.

View and manage your credentials

To set up Git credentials for your CodeCommit repository, complete the following steps:

  1. Sign in to the console using the infrastructure account. Make sure to sign in as the IAM user who creates and uses the Git credentials for connections to CodeCommit.
  2. On the IAM console, in the navigation pane, choose Users.
  3. From the list of users, choose your IAM user.
  4. On the Security Credentials tab, under HTTPS Git credentials for AWS CodeCommit, choose Generate credentials.
  5. Copy the user name and password that IAM generated for you, either by showing, copying, and pasting this information into a secure file on your local computer, or by choosing Download credentials to download this information as a .csv file. You need this information to connect to CodeCommit.
  6. After you save your credentials, choose Close.

Create a CodeCommit repository

To set up a CodeCommit repository, complete the following steps:

  1. On the CodeCommit console, in the navigation pane, choose Repositories.
  2. Choose Create repository.
  3. For Repository name, enter a name for the repository. To match the provided JSON for the infrastructure cross-account repository policy, use the name DataBrew-Recipes-Repo.
  4. Choose Create.

Add a README to the CodeCommit repository

The repository is currently empty; we add a README file to introduce the project to developers and provide instructions for contributions.

  1. On the CodeCommit console, in the navigation pane, choose Repositories.
  2. Choose the repository that you created in the previous step.
  3. Choose Create file.
  4. Use the code editor to create the contents of the README.

We recommend including steps from this post describing how to set up local access, push DataBrew recipes, and create pull requests.

  1. For File name, enter README.md.
  2. For Author name, enter your name.
  3. For Email address, enter an email address that repository users can contact you with.
  4. Choose Commit changes.

Configure cross-account repository access and create a local repository

With the CodeCommit repository in place, we create a local version for developer use. To do so, we use the AWS Command Line Interface (AWS CLI).

  1. Open a terminal window and configure the AWS CLI:
    > aws configure
  2. When prompted, provide the information in the following code (the access and secret keys are available in the .csv security credentials file downloaded as part of the prerequisite steps):
    > AWS Access Key ID [None]: user-access-key
    > AWS Secret Access Key ID [None]: user-secret-access-key
    > Default region name ID [None]: us-east-1
    > Default output format [None]: json
  3. In a plaintext editor, open the config file, also known as the AWS CLI configuration file. Depending on your operating system, this file might be located at ~/.aws/config on Linux, macOS, or Unix, or at drive:\Users\USERNAME\.aws\config on Windows.
  4. Update the file to include two entries, the default for the developer in the user account, and a second for cross-account access. The resulting file should look as follows:
    [default]
    account = user-account-id
    region = us-east-1
    output = json
    
    [profile CrossAccountAccessProfile]
    account = infrastructure-account-ID
    region = us-east-1
    output = json
    role_arn = arn:aws:iam::infrastructure-account-ID:role/CrossAccountRepositoryContributorRole
    source_profile = default
  5. Save your changes, and close the plaintext editor.
  6. Run git clone to clone the shared repository:
    > git clone codecommit://CrossAccountAccessProfile@DataBrew-Recipes-Repo

Push DataBrew recipes to your code repository

In DataBrew, a recipe represents a set of data transformation steps. Because recipes are defined as standalone entities, they can be downloaded and applied to multiple datasets through the use of recipe jobs. To download the recipe, either use DataBrew console’s download functionality (as described in this section) or the DescribeRecipe API operation.

Create the initial recipe commit

You’re now ready to commit your first DataBrew recipe to your repository.

  1. Sign in to the console using the user account.
  2. On the DataBrew console, in the navigation pane, choose Recipes.
  3. Select the recipe you want to download.
  4. Choose Download as JSON to trigger the browser to download the recipe locally.
  5. In your terminal window, locate the recipe and move it to the repository created in the previous step:
    > mv Downloads/your-recipe-name.json DataBrew-Recipes-Repo/
    > cd DataBrew-Recipes-Repo
  6. Run git checkout -b to create a new branch for the in-flight recipe change:
    > git checkout -b feature-branch-name
  7. Run git add to stage the change:
    > git add .
  8. Run git commit to commit the change, and add an explanatory commit message:
    > git commit -m "your-commit-message-here"
  9. Run git push to push the feature branch to the remote repository:
    > git push --set-upstream origin feature-branch-name

Create a pull request

Creating a pull request allows other developers to review changes to a recipe before it’s pushed through the pipeline.

  1. Sign in to the console using the cross-account repository contributor role. This is available at the following address:
    https://signin.thinkwithwp.com/switchrole?account=infrastructure-account-ID&roleName=CrossAccountRepositoryContributorRole
  2. Choose Switch Role.
  3. On the CodeCommit console, in the navigation pane, choose Repositories.
  4. Select the name of the repository that you created in the previous step.
  5. In the navigation pane, choose Pull requests.
  6. Choose Create pull request.
  7. For Source, choose the branch that contains the changes that you want reviewed (the feature branch).
  8. For Destination, choose the branch where you intend to merge your code changes when the pull request is closed (main).
  9. Choose Compare.
  10. For Title, enter a short description for your code review.
  11. Choose Create pull request.

Another developer in your organization (with cross-account repository permissions) can now view the pull request, provide any feedback, and merge the change.

Create a custom Lambda function for deploying recipes

CodePipeline supports the ability to create custom actions to handle cases that aren’t covered by default actions. We use this ability to create custom actions that invoke Lambda functions that update and publish DataBrew recipes.

Creating the production deployment ­Lambda function

To create your function, complete the following steps:

  1. Sign in to the AWS Management Console using the infrastructure account.
  2. On the Lambda console, in the navigation pane, choose Functions.
  3. Choose Create function.
  4. For Function name, enter a name for the function.
  5. For Runtime, choose the language you want to write the function in. If you want to use the code sample provided in this tutorial, choose Python 3.8.
  6. Choose Create function.

Author the production deployment Lambda function

The sample code we provide in this section reads the recipe contents from the CodeDeploy artifacts bucket, then assumes the cross-account role to update the recipe with the same name. If the recipe doesn’t exist, the Lambda function creates one. It then publishes the recipe and sends a status notification to CodePipeline.

  1. In the Lambda code editor, choose index.py and enter the following sample code:
    import os
    import json
    import zipfile
    import boto3
    import gzip
    from io import BytesIO
    
    def get_clients():
        s3_client = boto3.resource('s3')
        sts_connection = boto3.client('sts')
        cross_account = sts_connection.assume_role(RoleArn=os.environ['role'],RoleSessionName="session")
        access_key = cross_account['Credentials']['AccessKeyId']
        secrect_key = cross_account['Credentials']['SecretAccessKey']
        session_token = cross_account['Credentials']['SessionToken']
        databrew_client = boto3.client(
            'databrew',
            aws_access_key_id=access_key,
            aws_secret_access_key=secrect_key,
            aws_session_token=session_token,
        )
        return s3_client, databrew_client
    
    def get_name_contents(event, s3_client):
        s3_location = event['CodePipeline.job']['data']['inputArtifacts'][0]['location']['s3Location']
        s3_bucket = s3_location['bucketName']
        s3_file = s3_location['objectKey']
        zip_obj = s3_client.Object(bucket_name=s3_bucket, key=s3_file)
        # extracting compressed files
        buffer = BytesIO(zip_obj.get()["Body"].read())
        file_name = ''
        z = zipfile.ZipFile(buffer)
        json_content = ''
        for filename in z.namelist():
            if filename.endswith('.json'):
                file_name = filename
                with z.open(file_name) as content:
                    json_content = json.load(content)
        return file_name.replace('.json', ''), json_content
    
    def lambda_handler(event, context):
        codepipeline_client = boto3.client('codepipeline')
        job_id = event['CodePipeline.job']['id']
        try:
            # client creation
            clients = get_clients()
            s3_client = clients[0]
            databrew_client = clients[1]
            # getting file name and contents
            name_contents = get_name_contents(event, s3_client)
            recipe_lists = databrew_client.list_recipes(MaxResults=99)
            if name_contents[0] not in (x['Name'] for x in recipe_lists['Recipes']):
                databrew_client.create_recipe(Name=name_contents[0], Steps=name_contents[1])
            # updating recipe
            databrew_client.update_recipe(Description='updating recipe',Name=name_contents[0],Steps= name_contents[1])
            # publishing a recipe
            databrew_client.publish_recipe(Description='publishing recipe', Name=name_contents[0])
            # Notify AWS CodePipeline of a successful job
            codepipeline_client.put_job_success_result(jobId=job_id)
        except Exception as e:
            # Notifying pipeline of a failure
            codepipeline_client.put_job_failure_result(jobId=job_id,failureDetails={'type': 'JobFailed','message': str(e)})
    
  2. Choose Deploy.
  3. On the Configuration tab, choose Environment variables in the left navigation pane.
  4. Choose Edit.
  5. Choose Add environment variable.
  6. For Key, enter role.
  7. For Value, enter the ARN for the production cross-account role.
  8. Choose Save.

Update the Lambda function’s permissions

Lambda creates an execution role at the time of creation; it assumes this role when the function is invoked. We need to update this role in order to provide the function the required access to Amazon S3 and CodeCommit in the infrastructure account as well as assume role permissions to use the production account’s cross-account role for DataBrew.

  1. In the Lambda function editor, on the Configuration tab, choose Permissions in the left navigation pane.
  2. For Execution role, choose the role name to navigate to the IAM console.
  3. Choose Add inline policy.
  4. On the JSON tab, enter the following permissions:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "codepipeline:PutJobFailureResult",
                    "codepipeline:PutJobSuccessResult"
                ],
                "Resource": "*",
                "Effect": "Allow"
            },
            {
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": "arn:aws:s3::: codepipeline-us-east-1-*" 
                "Effect": "Allow"
            },
            {
                "Action": [
                    "logs:CreateLogStream",
                    "logs:CreateLogGroup",
                    "logs:PutLogEvents"
                ],
                "Resource": [
                    "arn:aws:logs:us-east-1:infrastructure account ID:log-group:/aws/lambda/Lambda function name"
                ],
                "Effect": "Allow"
            },
            {
                "Action": [
                    "sts:AssumeRole"
                ],
                "Resource": "arn:aws:iam::prod account ID:role/prod account role name",
                "Effect": "Allow"
            }
        ]
    }
  5. Choose Review policy.
  6. For Name, enter a name for the inline policy.
  7. Choose Create policy.

Create the pre-production deployment ­Lambda function

To create the pre-production custom deployer function, repeat the preceding sections for creating, authoring, and updating permissions, using the pre-production naming convention and role.

Create a three-stage pipeline in CodePipeline

CodePipeline is a continuous delivery service for fast and reliable application updates. We build a three-stage pipeline that includes a source, pre-production, and production stage. CodePipeline automatically builds, updates, and publishes DataBrew recipes every time a change is pushed to the code repository. This enables developers to rapidly and reliably deliver recipe updates while following best practices.

Create an application in CodeDeploy

In CodeDeploy, an application is a container for the software you want to deploy. Later, you use this application with CodePipeline to automate recipe deployments to DataBrew.

  1. On the CodeDeploy console, in the navigation pane, choose Applications.
  2. Choose Create application.
  3. For Application name, enter the name for your application.
  4. For Compute Platform, choose AWS Lambda.
  5. Choose Create application.

Create a deployment group in CodeDeploy

A deployment group is a resource that defines deployment-related settings. The DataBrew recipe pipeline doesn’t use a deployment group; however, CodePipeline requires at least two stages to successfully run. We create this as a temporary resource, which can be deleted after the pipeline is set up.

  1. On the CodeDeploy console, in the navigation pane, choose Applications.
  2. Select the application that you created in the previous step.
  3. Choose Create deployment group.
  4. For Deployment group name, enter a name for your deployment group.
  5. For Service role, choose any role with minimal access (this will be deleted).
  6. Choose Create deployment group.

Create a pipeline in CodePipeline

To set up a pipeline in CodePipeline, complete the following steps:

  1. On the CodePipeline console, in the navigation pane, choose Pipelines.
  2. Choose Create pipeline.
  3. For Pipeline name, enter the name for your pipeline.
  4. For Service role, select New service role to allow CodePipeline to create a service role in IAM.
  5. Choose Next.
  6. For Source provider, choose AWS CodeCommit.
  7. For Repository name, choose the repository you created in the previous step.
  8. For Branch name, choose mainline.
  9. Choose Next.
  10. Choose Skip build stage.
  11. Choose Skip.
  12. For Deploy provider, choose AWS CodeDeploy.
  13. For Application name, choose the application you created in the previous step.
  14. For Deployment group, choose the deployment group you created in the previous step.
  15. Choose Next.
  16. Review the information and choose Create pipeline.

Add stages to the pipeline

With the initial pipeline in place, we can now update the stages to invoke the custom Lambda functions.

  1. On the CodePipeline console, in the navigation pane, choose Pipelines.
  2. Choose the pipeline created in the previous step.
  3. Choose Edit.
  4. Choose Edit stage, then choose Delete to remove the existing Deploy stage.

We replace this with a custom Lambda stage.

  1. Choose + Add stage to create the production deployment stage.
  2. For Stage name, enter a name for the pipeline stage.
  3. Choose Add stage.
  4. Choose + Add action group.
  5. For Action name, enter a name for the action.
  6. For Action provider, choose AWS Lambda.
  7. For Function name, choose the Lambda function you previously created.
  8. Choose Done.
  9. On the Edit page, choose + Add stage to create the production deployment stage.
  10. Repeat the previous steps with production resources.
  11. On the Edit page, choose Save.

The resulting pipeline should contain three stages.

You can now go back to the CodeDeploy console and delete the deployment group.

Test your system

We’re all set! To test the system, we merge the pull request previously created and release the change through the pipeline.

  1. Sign in to the console using the user account and switch to the cross-account repository contributor role. This is available at the following address:
    https://signin.thinkwithwp.com/switchrole?account=infrastructure-account-ID&roleName=CrossAccountRepositoryContributorRole 
  2. Choose Switch Role.
  3. On the CodeCommit console, in the navigation pane, choose Repositories.
  4. Select the repository that you created in the previous step.
  5. In the navigation pane, choose Pull requests.
  6. Choose the pull request created in the previous step.
  7. Review the change and choose Merge.
  8. Choose Merge pull request.

To see the pipeline deploy, sign in to the console using the infrastructure account and navigate to the pipeline. After the pipeline deploys, log in to the pre-production and production accounts. You should see the recipe updated with a new major version and the steps you included in your commit.

Clean up

To avoid incurring future charges, delete the resources created during this walkthrough.

Conclusion

In this post, we walked through how to use DataBrew alongside CodeCommit and CodePipeline to set up CI/CD pipelines for DataBrew recipes. We encourage you follow this approach to automate the recipe release process in a multi-account environment.


About the Authors

Romi Boimer is a Software Development Manager at AWS and a technical lead for AWS Glue DataBrew. She designs and builds solutions that enable customers to efficiently prepare and manage their data. Romi has a passion for aerial arts, in her spare time she enjoys fighting gravity and hanging from fabric.

 

 

 

Gaurav Wadhawan is a Software Engineer working on AWS Glue DataBrew. He’s passionate about big data. He spends his free time exploring new places and trying new cuisines.