Redacting sensitive information with user-defined functions in Amazon Athena

Amazon Athena now supports user-defined functions (in Preview), a feature that enables you to write custom scalar functions and invoke them in SQL queries. Although Athena provides built-in functions, UDFs enable you to perform custom processing such as compressing and decompressing data, redacting sensitive data, or applying customized decryption. You can write your UDFs in Java using the Athena Query Federation SDK. When a UDF is used in a SQL query submitted to Athena, it’s invoked and run on AWS Lambda. You can use UDFs in both SELECT and FILTER clauses of a SQL query, and invoke multiple UDFs in the same query. Athena UDF functionality is available in Preview mode in the US East (N. Virginia) Region.

This blog post covers basic functionalities of Athena UDF, writing your own UDF, building and publishing your UDF to AWS Serverless Application Repository, configuring the UDF connector application, and using the UDF in Athena queries. You can implement use cases involving sensitive data, such as personal identifiable information (PII), credit cards, or Social Security numbers (SSN). In this post, we deploy a Redact UDF and use that to mask sensitive information.

Prerequisites

Before creating your development environment, you must have the following prerequisites:

An AWS account that provides access to AWS services
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI)
The IAM user has permissions to create an IAM role and policies, and create stacks in AWS CloudFormation
Git, Maven, and an AWS Cloud9 integrated development environment (IDE) installed

To set up the development environment and address these prerequisites, deploy the CloudFormation template in the first part of this series, Extracting and joining data from multiple data sources with Athena Federated Query. The post provides instructions on building the required test environment and resources using the CloudFormation template.

Creating your IDE

After you deploy the CloudFormation template, you need to create the required AWS resources. To create the development environment to build and deploy the UDF, we use an AWS Cloud9 IDE. On the AWS Cloud9 console, locate your environment and choose Open IDE.

AWS Cloud9 Resize

The AWS Cloud9 IDE comes with a default 10 GB disk space, which can fill quickly when setting up the development environment, so you should resize it.

Run the following command in the AWS Cloud9 IDE terminal to get the resize script:

curl https://aws-bigdata-blog.s3.amazonaws.com/artifacts/aws-data-analytics-workshop/athena-workshop/scripts/cloud9_resize.sh > cloud9_resize.sh

Run the script by issuing the following command on the terminal to resize the disk to 20 GB:
```
sh cloud9_resize.sh 20
```

Check the free space on the disk with the following code:
```
df -h
```

You should see something like the following screenshot.

Setting up the IDE

Next, you clone the SDK and prepare your IDE.

Make sure that Git is installed on your system by entering the following code:
```
sudo yum install git -y
```

To install the Athena Query Federation SDK, enter the following command at the command line to clone the SDK repository. This repository includes the SDK, examples, and a suite of data source connectors.
```
git clone https://github.com/awslabs/aws-athena-query-federation.git
```

If you’re working on a development machine that already has Apache Maven, the AWS CLI, and the AWS Serverless Application Model build tool installed, you can skip this step.

From the root of the aws-athena-query-federation directory that you created when you cloned the repository, run the prepare_dev_env.sh script that prepares your development environment:
```
cd aws-athena-query-federation	

sudo chown ec2-user:ec2-user ~/.profile

./tools/prepare_dev_env.sh
```

This script requires manual inputs to run (choosing Enter as needed during the setup steps when prompted). You can edit this script to remove the manual inputs if you want to automate the setup entirely.

Update your shell to source new variables created by the installation process or restart your terminal session:
```
source ~/.profile
```

Run the following code from the athena-federation-sdk directory within the GitHub project you checked out earlier:
```
mvn clean install
```

Adding the UDF code and publishing the connector

In this section, you add your UDF function, build the JAR file, and deploy the connector.

In the AWS Cloud9 IDE, expand the aws-athena-query-federation project and navigate to the AthenaUDFHandler.java file.
Choose the file (double-click) to open it for editing.

Now we add the UDF code for a String Redact function, which redacts a string to show only the last four characters. You can use this UDF function to mask sensitive information.

Enter the following code:

/** Redact a string to show only the last 4 characters
     * 
     * 
     * 
     * @param input the string to redact
     * @return redacted string
     */
    public String redact(String input)
    {
        String redactedString = new StringBuilder(input).replace(0,     
            input.length()- 4, new String(new char[input.length() -   
            4]).replace("\0", "x")).toString(); 
        return redactedString;
    }

You can also copy the modified code with the following command (which must be run from the aws-athena-query-federation directory):

curl https://aws-bigdata-blog.s3.amazonaws.com/artifacts/aws-data-analytics-workshop/athena-workshop/scripts/AthenaUDFHandler.java > athena-udfs/src/main/java/com/amazonaws/athena/connectors/udfs/AthenaUDFHandler.java

After copying the file, you can open it in the AWS Cloud9 IDE to see its contents.

To build the JAR file, save the file and run mvn clean install to build your project:
```
cd ~/environment/aws-athena-query-federation/athena-udfs/

mvn clean install
```

After it successfully builds, a JAR file is created in the target folder of your project named artifactId-version.jar, where artifactId is the name you provided in the Maven project, for example, athena-udfs.

From the athena-udfs directory, run the following code to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET_NAME in the command is the Amazon Simple Storage Service (Amazon S3) location where a copy of the connector's code is stored for the AWS Serverless Application Repository to retrieve it.

../tools/publish.sh S3_BUCKET_NAME athena-udfs

This allows users with relevant permission levels to deploy instances of the connector via a one-click form.

When the connector is published successfully, it looks like the following screenshot.

To see AthenaUserDefinedFunctions, choose the link shown in the terminal after the publish is successful or navigate to the AWS Serverless Application Repository by choosing Available Applications, Private applications.

Setting up the UDF connector

Now that the UDF connector code is published, we can install the UDF connector to use with Athena.

Choose the AthenaUserDefinedFunctions application listed on the Private applications section in the AWS Serverless Application Repository.
For Application name, leave it as the default name AthenaUserDefinedFunctions.
For SecretNameorPrefix, enter a secret name if you have already saved it in AWS Secrets Manager; otherwise, enter database-*.
For LambdaFunctionName, enter customudf.
Leave the remaining fields as default.
Select I acknowledge that this app creates custom IAM roles.
Choose Deploy.

Querying with UDF in Athena

Now that the UDF connector code is deployed, we can run Athena queries that use the UDF.

If you ran the CloudFormation template from Part 1 of this blog series, the AmazonAthenaPreviewFunctionality workgroup was already created. If not, choose Create Workgroup on the Athena console and create a workgroup named AmazonAthenaPreviewFunctionality and set up your query result location in Amazon S3.

To proceed, make sure you are in the workgroup AmazonAthenaPreviewFunctionality. If not, choose the workgroup AmazonAthenaPreviewFunctionality and choose Switch workgroup.

You can now run a query to use the Redact UDF to mask sensitive information from PII columns. To show the comparison, we have included the PII column and masked data as part of the query results. If you ran the CloudFormation template from Part 1 of this series, you can navigate to the Saved Queries on the Athena console and choose RedactUdfCustomerAddress.

The following screenshot shows your query.

After the query runs, you should see results like the following screenshot. The redact_name, redact_phone, and redact_address columns only show the last four characters.

Cleaning up

To clean up the resources created as part of your CloudFormation template, complete the following steps:

On the Amazon S3 console, empty and delete the bucket athena-federation-workshop-<account-id>.
If you’re using the AWS CLI, delete the objects in the athena-federation-workshop-<account-id> bucket with the following code (make sure you’re running this command on the correct bucket):
```
 aws s3 rm s3://athena-federation-workshop-<account-id> --recursive
```

Use the AWS CloudFormation console or AWS CLI to delete the stacks Athena-Federation-Workshop and serverlessrepo-AthenaUserDefinedFunctions

Summary

In this post, you learned about Athena user-defined functions, how to create your own UDF, and how to deploy it to a private AWS Serverless Application Repository. You also learned how to configure the UDF and use it in your Athena queries. In the next post of this series, we discuss and demonstrate how to use a machine learning (ML) anomaly detection model developed on Amazon SageMaker and use that model in Athena queries to invoke an ML inference function to detect anomaly values in our orders dataset.

About the Authors

Saurabh Bhutyani is a Senior Big Data specialist solutions architect at Amazon Web Services. He is an early adopter of open source Big Data technologies. At AWS, he works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

Amir Basirat is a Big Data specialist solutions architect at Amazon Web Services, focused on Amazon EMR, Amazon Athena, AWS Glue and AWS Lake Formation, where he helps customers craft distributed analytics applications on the AWS platform. Prior to his AWS Cloud journey, he worked as a Big Data specialist for different technology companies. He also has a PhD in computer science, where his research was primarily focused on large-scale distributed computing and neural networks.

AWS Big Data Blog