AWS Machine Learning Blog

Creating a persistent custom R environment for Amazon SageMaker

Amazon SageMaker is a fully managed service that allows you to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models. In August 2019, Amazon SageMaker announced the availability of the pre-installed R kernel in all Regions. This capability is available out-of-the-box and comes with the reticulate library pre-installed. This library offers an R interface for the Amazon SageMaker Python SDK, which enables you to invoke Python modules from within an R script.

This post discusses how to create a custom R environment (kernel) in Amazon SageMaker on top of the built-in R kernel and how to persist that between sessions. The post explains how to install a new package in the R environment, how this new environment can be saved on Amazon Simple Storage Service (Amazon S3), and how you can use it to create new Amazon SageMaker instances using the Amazon SageMaker lifecycle configuration. The post also includes bash scripts that you can use for lifecycle configurations when creating or starting an Amazon SageMaker notebook instance.

Background

The R kernel in Amazon SageMaker is built using the IRKernel package, which installs a kernel with the name ir and a display name of R in a Jupyter environment.

You can manage this environment by using Conda, and install specific packages and dependencies. However, by default, an R kernel installed from a notebook instance doesn’t persist to other notebook instance sessions. Every time you start and stop an Amazon SageMaker instance, the R kernel returns to its default environment.

This post walks you through the process of installing R packages in Amazon SageMaker using the following sources:

  • Anaconda Cloud
  • CRAN
  • Github

After you create your environment, you save it on the instance’s Amazon Elastic Block Store (Amazon EBS) storage to make it persistent. You can also store this environment on Amazon S3 and use it to build custom R environments for new Amazon SageMaker instances. For more information, see Customize a Notebook Instance Using a Lifecycle Configuration Script.

Creating an Amazon SageMaker notebook instance with the R kernel

To create an Amazon SageMaker notebook instance with the R kernel, complete the following steps:

  1. Create a notebook instance.
  2. When the instance status shows as In Service, open Jupyter.
  3. From the New drop-down menu, choose R.

When the new notebook opens, you should see the R logo on the upper right corner of the notebook space.

For more details about creating an Amazon SageMaker notebook instance with R kernel, visit the coding with R on Amazon SageMaker notebook instances blog post.

Installing packages in the Amazon SageMaker R kernel

The Amazon SageMaker R kernel comes with over 140 standard packages. To get the list of these installed packages, you can run the following script in a SageMaker notebook instance with R kernel:

installed.packages()

If you need to install additional packages, you can install from Anaconda Cloud, a CRAN archive, or directly from GitHub.

Installing from Anaconda Cloud

The preferred method for installing R packages is to install the package from the Anaconda Cloud repository. This method gives you access to different channels (such as R and Conda Forge), which allows you to install specific versions of the package. If you’re doing this in Amazon SageMaker using the R kernel, use the system() command to submit the conda install command.

If you’re installing this in the Amazon SageMaker Jupyter bash terminal, you can just use conda install as follows:

conda install -n R -c conda-forge r-rjava

But in Amazon SageMaker, enter the following code:

system("conda install -n R -c conda-forge r-rjava")

The preceding code uses the conda-forge channel, which installs rJava version 0.9_12 (at the time this blog post was published). However, if you use the following code (which uses r channel), it installs version 0.9_11 (at the time this blog post was published):

system("conda install -n R -c r r-rjava")

To search for the specific package name and choose the correct channel for your version, visit the Anaconda Cloud website and search for the package. R packages are named in “r-<package_name>” foramt..

Conda is the preferred method for installing packages, and Anaconda Cloud is the preferred archive because it provides access to the most stable versions of Conda environments.

Installing from the CRAN archive

As an alternative to Anaconda, you can use the Comprehensive R Archive Network (CRAN) archive. The CRAN archive is a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. You can use this archive to install packages in R using install.packages(). This installs the latest version of the package. See the following code:

install.packages(c('mlbench', 'MVar'),
      repo = 'http://cran.rstudio.com',
      dependencies = TRUE)

Import that package to your R code with the following code:

library(mlbench)

Amazon SageMaker instances use Amazon Linux AMI, which is a distribution that evolved from Red Hat Enterprise Linux (RHEL) and CentOS. It’s available for use within Amazon Elastic Compute Cloud (Amazon EC2) instances that run Amazon SageMaker. If you’re planning to install packages directly from the source, make sure you select the right operating system. You can check the operating system with the following script in the Amazon SageMaker Jupyter bash terminal:

sh-4.2$ cat /etc/os-release

And the output looks like this (At the time of publication):

NAME="Amazon Linux AMI"
VERSION="2018.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2018.03"
PRETTY_NAME="Amazon Linux AMI 2018.03"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2018.03:ga"
HOME_URL=http://thinkwithwp.com/amazon-linux-ami/

Installing from Github

You can also use devtools and install_github to get the content directly from the package developer’s repository. See the following code:

install.packages("devtools")
devtools::install_github("malcolmbarrett/ggdag")

This installs the package and its dependencies. However, this isn’t the preferred method for installing packages in Amazon SageMaker.

Persisting the custom R environment between sessions

By default, Amazon SageMaker launches the base R kernel every time you stop and start an Amazon SageMaker instance. Any additional packages you install are lost when you stop the instance, and you have to reinstall the packages when you start the instance again. This is time-consuming and cumbersome. The solution is to save the environment on the EBS storage of the instance and link it to a custom R kernel upon startup using the Amazon SageMaker lifecycle configuration script. For more information, see Customize a Notebook Instance Using a Lifecycle Configuration Script.

This section outlines the steps to make your custom R environment persistent.

Saving the environment on Amazon SageMaker EBS

You first need to save the environment on the instance’s EBS storage by cloning the environment. You can run the following script in Amazon Sagemaker Jupyter bash terminal:

conda create --prefix /home/ec2-user/SageMaker/envs/custom-r --clone R

This creates an envs/custom-r folder under the Amazon SageMaker folder on your instance EBS, which you have access to. See the following screenshot.

If you want to use this custom environment in the same Amazon SageMaker instance later (and not in a different instance), you can skip to the  Lifecycle configuration to start the instance with the custom R environment step in this blog post.

Saving the environment to Amazon S3 to create new Amazon SageMaker instances

To use the custom R environment repeatedly when creating this Amazon SageMaker instance (for example, for your development team), save the environment to Amazon S3 as a .zip file and download that to the instance at the Create step. You can run the following script in Amazon SageMaker Juypyter bash terminal:

zip -r ~/SageMaker/custom_r.zip ~/SageMaker/envs/
aws s3 cp ~/SageMaker/custom_r.zip s3://[YOUR BUCKET]/

Lifecycle configuration to create new instances with the custom R environment

To create a new instance and use the custom environment in that instance, you need to bring the .zip environment from Amazon S3 to the instance. You can do this automatically on the Amazon SageMaker console with the lifecycle configuration script. This script downloads the .zip file from Amazon S3 to the /SageMaker/ folder on the instance’s EBS, unzips the file, recreates the /envs/ folder, and removes the redundant folders.

  1. On the Amazon SageMaker console, under Notebook, choose Lifecycle configurations.
  2. Select Create Configuration
  3. Name it Custom-R-Env.
  4. On the Create notebook tab, enter the following script.
    ## On-Create: Bringing custom environment from S3 to SageMaker instance
    ## NOTE: Your SageMaker IAM role should have access to this bucket
    
    #!/bin/bash    
    sudo -u ec2-user -i <<'EOF'
    aws s3 cp s3://[YOUR BUCKET]/custom_r.zip ~/SageMaker/
    unzip ~/SageMaker/custom_r.zip -d ~/SageMaker/
    mv ~/SageMaker/home/ec2-user/SageMaker/envs/ ~/SageMaker/envs
    rm -rf ~/SageMaker/home/
    rm ~/SageMaker/custom_r.zip
    EOF
    

  5. Press Create Configuration.

Lifecycle configuration to start the instance with the custom R environment

This step is the same whether you created the custom R environment in the same instance and cloned it to the ./envs/ folder or downloaded the .zip file from Amazon S3 while creating the instance.

This script creates a symbolic link between the ./evns/ folder (which contains the custom R environment) and the anaconda custom-r environment. This allows the environment to be listed under the kernels in Amazon SageMaker.

  1. On the Amazon SageMaker console, under Notebook, choose Lifecycle configurations.
  2. Select Create Configuration
  3. Name it Custom-R-Env (If you have already created the configuration in the previous step, you can select the configuration from the list and choose Edit).
  4. On the Start notebook tab, enter the following script:
    ## On-Start: After you set up the environment in the instance
    ## then you can have this life-cycle config to link the custom env with kernel
    
    #!/bin/bash    
    sudo -u ec2-user -i <<'EOF'    
    ln -s /home/ec2-user/SageMaker/envs/custom-r /home/ec2-user/anaconda3/envs/custom-r
    EOF
    echo "Restarting the Jupyter server..."
    restart jupyter-server
    

  5. Press Create Configuration (or Update if you are editing an existing configuration).

Assigning the lifecycle configuration to an Amazon SageMaker instance

You can assign a lifecycle configuration when creating a notebook instance. For more information, see Customize a Notebook Instance Using a Lifecycle Configuration Script.

To create a notebook with your lifecycle configuration (Custom-R-Env), you need to assign the script to the notebook under the Additional Configuration section. All other steps are the same as creating any Amazon SageMaker instance.

Using the custom R environment

If you’re opening your existing instance where you created the custom environment, you should see your existing files and codes, as well as the /envs/ folder.

However, if you’re creating a new instance and used the lifecycle script to bring the environment from Amazon S3, complete the following steps:

  1. When your instance status shows as In Service, open Jupyter. You should see an /envs/ folder in your Amazon SageMaker files. That is your custom environment.
  2. From the New drop-down menu, choose conda_r_custom-r.

You now have a notebook with your custom R environment. When in your notebook, you should see the R logo in the upper right corner corner of the Juypyter environment, which indicates the kernel is an R kernel, and the name of your kernel should be conda_r_custom-r. To test the environment, import one of the libraries that you included in the custom environment (for example, rJava).

library(rJava)

Your custom R environment is now up and running in the instance, and you can program in R using the reticulate package.

Conclusion

This post walked you through creating a custom, persistent R environment for Amazon SageMaker notebook instances. For example notebooks for R on Amazon SageMaker, see the Amazon SageMaker examples GitHub repository. For more details about creating an Amazon SageMaker notebook instance with R kernel, visit the coding with R on Amazon SageMaker notebook instances blog post. You can visit R User Guide to Amazon SageMaker on the developer guide for more details on ways of leveraging Amazon SageMaker features using R. In addition, for more resources to further your experience with Amazon SageMaker, see the AWS Machine Learning Blog.


About the author

Nick Minaie is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solution Architect, helping customers on their journey to well-architected machine learning solutions at scale. In his spare time, Nick enjoys abstract painting and loves to explore the nature.