AWS Public Sector Blog
Use Amazon SageMaker to perform data analytics in AWS GovCloud (US) Regions
Introduction
Amazon SageMaker is a fully managed machine learning (ML) service that provides various capabilities, including Jupyter Notebook instances. While RStudio, a popular integrated development environment (IDE) for R, is available as a managed service in Amazon Web Services (AWS) commercial Regions, it’s currently not offered in AWS GovCloud (US) Regions. However, you can use SageMaker notebook instances with the R kernel to perform data analytics tasks in AWS GovCloud (US) Regions.
In this post, we’ll walk through the steps to create a notebook instance with the R kernel and demonstrate how to query data stored in an AWS Glue Data Catalog repository using R. In this example, we use the RAthena library, which requires Amazon Athena usage, however, if you have more experience with R and would like to use other preferred R libraries for querying data instead, you’re free to do that.
Prerequisites
Before you begin, verify that you have the following:
- An AWS Account in one of the two AWS GovCloud (US) Regions
- An AWS Glue Data Catalog repository in an AWS GovCloud (US) Region with tables you want to analyze
Solution walkthrough
Step 1: Create a SageMaker notebook instance as outlined in the Amazon SageMaker Developer Guide.
- If it’s your first time using the SageMaker console, it may prompt you to create a SageMaker domain. For the functionality outlined in this post, that’s unnecessary. Select Notebook instances in the left-hand menu.
Step 2: Open the Jupyter Notebook application.
- Once the notebook instance status changes to InService, select the instance in the list.
- In the Actions column, choose Open Jupyter. This will open the Jupyter Notebook application in a new browser tab.
Step 3: Create a new R notebook.
- In the Jupyter Notebook application, choose New in the top-right corner.
- From the dropdown menu, select R.
- A new notebook with an R kernel will open in your workspace.
Step 4: Query data from AWS Glue Data Catalog.
In this example, we demonstrate how to read data from an existing table in AWS Glue Data Catalog and perform basic data analysis using R. In a Jupyter notebook, you can write and execute code one line at a time. Each line of code is entered into a separate cell, which is a rectangular box where you can type your code, line by line.
- In the blank line in your new notebook, type or copy the following line of code to install the RAthena library on your notebook instance.
install.packages("RAthena")
- Once you’ve entered this line of code, you can execute it by pressing Shift+Enter or by choosing the Play button in the toolbar (it looks like a right-pointing triangle).
- After executing the cell, the output of your code will be displayed immediately below the cell.
- Create a new cell in your notebook by choosing the + button in the top menu or by pressing the keyboard shortcut (for example, Esc and B for a new cell below).
- In the new cell, type the following line of code to load the RAthena library that we just installed.
library(RAthena)
- Execute the new cell by pressing Shift+Enter or choosing the Play button again.
- Repeat steps 2–4 for each line of the following code, replacing the bold text with your information.
Sys.setenv(AWS_ATHENA_S3_STAGING_DIR = "s3://<s3-bucket-to-store-athena-results>")
con <- dbConnect(RAthena::athena())
Sys.setenv(AWS_ATHENA_WORK_GROUP = "primary") # Athena workgroup name
query <- dbSendQuery(con, "SELECT * FROM <myGlueDatabase>.<myGlueTableName> LIMIT 10")
result <- dbFetch(query)
print(result)
You should now have a notebook that looks similar to the following Figure 1, though it might be slightly different depending on your outputs. If your code is accurate, and your permissions are correct, you should be able to see the printed results of your code.
As an example, Figure 2 displays the first five results of an AWS Data Catalog table that’s populated with book titles and authors.
Now that you have an active connection, you can use other SQL statements and queries with the familiar R language syntax.
Cleanup
Step 1: Disconnect connection.
1. Once you’re finished, you can disconnect your session by running the following command in a new line in the notebook.
dbDisconnect(con)
Step 2: Stop notebook instance.
- Go back into your SageMaker console.
- Expand Notebooks.
- Select Notebook instance.
- Select the radio button next to your instance.
- Select the Actions dropdown and choose Stop.
SageMaker notebook ml.t2.medium and ml.t3.medium instances fall under the SageMaker Free Tier for 250 hours each month for the first two months, so it’s best practice to shut down notebook instances when not in use to optimize cost. It’s also possible to configure a SageMaker lifecycle configuration to shut down instances automatically after a set duration (30 minutes, for example).
Conclusion
In this post, we demonstrated how to use SageMaker notebooks with the R kernel to perform data analytics tasks in AWS GovCloud (US) Regions. By using SageMaker notebooks, you can enjoy a fully managed Jupyter Notebook environment without needing self-hosted or managed infrastructure. This approach provides a convenient and scalable solution for working with R in AWS GovCloud (US) Regions, enabling you to analyze data stored in your AWS Glue Data Catalog repository and use the vast ecosystem of R packages and libraries.
Explore a wealth of setup guides, usage tips, and diverse examples by visiting the Amazon SageMaker Examples GitHub repository. Dive deeper into the possibilities of SageMaker and elevate your machine learning (ML) journey today.