AWS Machine Learning Blog

Amazon SageMaker notebooks now support Git integration for increased persistence, collaboration, and reproducibility

It’s now possible to associate GitHub, AWS CodeCommit, and any self-hosted Git repository with Amazon SageMaker notebook instances to easily and securely collaborate and ensure version-control with Jupyter Notebooks. In this blog post, I’ll elaborate on the benefits of using Git-based version-control systems and how to set up your notebook instances to work with Git repositories.

Data science projects demand collaborative effort. Data scientists, machine learning developers, data engineers, analysts, and business decision-makers need to share insights, delegate tasks, and review the history of their work to ensure a healthy journey from ideation to productization of machine learning models. Git-based version-control systems allow us to centralize data science practices in a sharable environment. By using Git repositories with Jupyter Notebooks, we can coauthor projects, track code changes, and amalgamate software engineering and data science practices for production-ready code management.

Additionally, notebooks in a notebook instance are stored on durable Amazon Elastic Block Store (EBS) volumes. However, they don’t persist beyond the life of the notebook instance. That means that if you delete your notebook instance, you will lose your work. Storing notebooks in a Git repository enables you to decouple Jupyter Notebooks from the instance lifecycle and keep them as standalone documents that can be referenced and reused in the future.

Finally, most of the publicly available content about machine learning and deep learning techniques are provided on Jupyter Notebooks that are hosted in Git repositories, such as GitHub. Cloning these notebooks seamlessly onto your notebook instances speeds up the learning process by allowing you to easily discover, execute, and share the publicly available learning material.

There are two ways to associate Git repositories with Amazon SageMaker notebook instances:

  • If you want to clone a public Git repository, which doesn’t require any credentials, you can simply provide the URL for the repository while creating a notebook instance. Amazon SageMaker will kick off your instance with the Git repository cloned onto it.
  • If you want to associate a private Git repository that requires credentials or personal access token, or if you want to store public Git repository information for future use, you first need to add this Git repository as a resource in your Amazon SageMaker account. When you add a Git repository that requires authentication, you can specify an AWS Secrets Manager secret that contains credentials or personal access token to access the repository. After you add a Git repository as a resource, you can create and use as many notebook instances as you need to be associated with this repository.

Since it’s comprehensive, I’ll walk you through the second use case where we introduce a private Git repository to Amazon SageMaker as a resource and create a notebook instance that is associated with this Git repository.

Add a Git repository to your Amazon SageMaker account

You can add Git repositories to your Amazon SageMaker account in the AWS Management Console or by using the AWS CLI.

To add a Git repository to Amazon SageMaker using the AWS Management Console, open the Amazon SageMaker console at https://console.thinkwithwp.com/sagemaker/.

In the left navigation pane, choose Git repositories, which provides a centralized visibility and management for all of your Git repositories. Choose Add repository.

To add an AWS CodeCommit repository, choose AWS CodeCommit. Here, you can create a new AWS CodeCommit repository or use an existing one. Please note that the repository name must be 1 to 63 characters. Valid characters are a-z, A-Z, 0-9, and – (hyphen).

If you are creating a new AWS CodeCommit repository, the action button to Add repository will be active after your AWS CodeCommit repository is created.

To add a Git repository hosted somewhere other than AWS CodeCommit, choose GitHub/Other Git-based repo.

Enter the URL for the repository and a name to use for the repository in Amazon SageMaker. The name must be 1 to 63 characters. Valid characters are a-z, A-Z, 0-9, and – (hyphen).

For Git credentials, enter the credentials to use to authenticate to the repository. For GitHub repositories, instead of your account password, we strongly recommend using a Personal Access Token generated by your Git service provider due to its convenience and safety.

Amazon SageMaker uses AWS Secrets Manager behind the scenes to securely store Git credentials for private Git repositories that require authentication. Here, you can either create a new AWS Secrets Manager secret or choose an existing one. For more information about AWS Secrets Manager secrets or about using your company’s LDAP credentials with AWS Secrets Manager, see the AWS Secrets Manager User Guide.

If you are creating a new secret to store your credentials, the action button to Add repository will be active after your new secret is created.

You can view and manage all Git repositories that you have associated with Amazon SageMaker under the Git repositories menu.

To add a Git repository to Amazon SageMaker using the CLI, use the create-code-repository AWS CLI command.

If you are adding a private Git repository other than AWS CodeCommit, you first need to create an AWS Secrets Manager secret to store your credentials and obtain the Amazon Resource Name (ARN) of the AWS Secrets Manager secret to provide while using create-code-repository AWS CLI command.

Ensure that your IAM role has a policy update to give you permission to access for GetSecretValue.

Also, the secret must be in the following format:

{“username”: UserName, “password”: Password}

If you are adding a public Git repository, you don’t need an AWS Secrets Manager secret.

Specify a name for the repository as the value of the code-repository-name argument. The name must be 1 to 63 characters. Valid characters are a-z, A-Z, 0-9, and – (hyphen). Specify the default branch, the URL of the Git repository, and the Amazon Resource Name (ARN) of an AWS Secrets Manager secret that contains the credentials to use to authenticate the repository as the value of the git-config argument.

The following command creates a new repository named MyRespository in your Amazon SageMaker account that points to a Git repository hosted at https://github.com/myprofile/my-repo”.

aws sagemaker create-code-repository \ --code-repository-name "MyRepository" \ --git-config '{"Branch":"master", \ "RepositoryUrl" : "https://github.com/myprofile/my-repo", \ "SecretArn" : "arn:aws:secretsmanager:us-east-2:012345678901:secret:my-secret-ABc0DE"}'

Create a notebook instance with associated Git repositories

To create an instance with Git repositories cloned to it, go to Notebook instances on the Amazon SageMaker console, and choose Create notebook instance.

Follow the steps described in the Amazon SageMaker Developer Guide for other configurations, such as Amazon Virtual Private Cloud (VPC) or AWS Identity and Access Management (IAM).

You can  use existing AWS CodeCommit repositories that you have not created with Amazon SageMaker, but directly with AWS CodeCommit. However, you need to ensure that you have either added the “AmazonSageMaker-” prefix to the name of the repository (for example, AmazonSageMaker-MyAWSCodeCommitRepository) or that you have updated the IAM policy for your notebook instance’s execution role to grant permission to Amazon SageMaker for accessing your AWS CodeCommit repository. Update the IAM policy for your notebook instance’s execution role to have codecommit:GitPull and codecommit:GitPush permissions. For a full list of AWS CodeCommit permissions, see the AWS CodeCommit User Guide.

To clone Git repositories, use the menu to specify which repositories you want to clone:

Here, if you want to use a public repository that you haven’t added or don’t want to add to your Amazon SageMaker account, you can select Clone a public Git repository to this notebook instance only. In this case, you can simply paste the public URL for the repository, and Amazon SageMaker will clone it to your notebook instance.

You can also select Add a repository to Amazon SageMaker, which will lead you to the previous menu where we added repositories to Amazon SageMaker.

Finally, you can see the repositories that you have added to Amazon SageMaker on the menu. If you just added a repository and don’t see it yet on the menu, try to refresh the menu by using the refresh button.

You can select one default repository and up to three additional repositories to be associated with your notebook instance.

Your notebook instance will be created with the Git repositories cloned to it.

Open JupyterLab to see your repositories on the left menu.

If you prefer to execute these actions using CLI commands, refer to the Amazon SageMaker Developer Guide for the details.

Using Git repositories in a notebook instance

Your notebook instance will open in the default repository, which is installed in your notebook instance under /home/ec2-user/SageMaker. You can manually run Git commands in a notebook cell. For example:

!git pull origin master

To open any of the additional repositories, navigate up one folder. The additional repositories are also installed as directories under /home/ec2-user/SageMaker.

In collaboration with the Project Jupyter community, the Amazon SageMaker team has redesigned and developed an open-source Git extension for JupyterLab. If you are not a fan of CLI commands, the Git extension provides an intuitive and visual way to collaborate on JupyterLab. You can use the Git extension to create and switch branches, stage and commit code changes, send push and pull requests to shared repositories, see the version history in detail, and revert to previous versions when needed.

If you open the notebook instance with a JupyterLab interface, the jupyter-git extension is installed and available to use. For information about the JupyterLab Git extension, visit the JupyterLab GitHub page.

Conclusion

By using Git workflows easily with notebooks, you will be able to clone content to your JupyterLab workbench, participate in multiple-coauthor projects, and branch your data science work within the organization’s broader development and production workflows.

 


About the Author

Erkan Tas is a Senior Product Manager for Amazon SageMaker. He is on a mission to make Artificial Intelligence easy, accessible, and scalable through AWS platforms. He is also a sailor, science and nature admirer, Go and Stratocaster player.