AWS Big Data Blog
Perform ETL operations using Amazon Redshift RSQL
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.
There are many ways to interact with Amazon Redshift. You can programmatically access data in your Amazon Redshift cluster by using the Amazon Redshift Data API, or you can use a web-based interface such as Amazon Redshift Query Editor V2 to author SQL queries. You can also interact with Amazon Redshift in interactive or batch mode via Amazon Redshift RSQL.
Previously, you had to use the PostgreSQL psql command line tool to interact with Amazon Redshift for simple use cases such as importing and exporting data to and from Amazon Redshift or submitting SQL in batch or interactive mode, or for advanced use cases such as performing ETL operations without writing complex ETL jobs.
Now you can use Amazon Redshift RSQL, a new command line tool to connect to an Amazon Redshift cluster and perform operations such as describe database objects, query data, view and export query results in various output file formats, run scripts that include both SQL and complex business logic, perform data cleansing and transformation operations using familiar SQL, and write ETL and ELT scripts using enhanced control flow and advanced error handling. Moreover, if you’re migrating from self-managed data warehousing engines such as Teradata to Amazon Redshift and using Teradata BTEQ (Basic Teradata Query) scripting for data automation, ETL or other tasks can now migrate to Amazon Redshift RSQL.
This post explains how to use Amazon Redshift RSQL for ETL, data cleansing and preparation, enhanced control flow, and exception and error handling.
Solution overview
Amazon Redshift RSQL enables you to connect to your Amazon Redshift cluster via a command line. It supports the capabilities of the PostgreSQL psql command line tool with an additional set of Amazon Redshift specific capabilities:
- Describe properties or attributes of external tables in an AWS Glue catalog or Apache Hive metastore, external databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, and tables shared using Amazon Redshift data sharing
- Use enhanced control flow commands such as
\IF
,\ELSEIF
,\ELSE
,\ENDIF
,\GOTO
, and\LABEL
- Use single sign-on (SSO) authentication using ADFS, PingIdentity, Okta, Azure AD, or other SAML/JWT based identity providers (IdPs), as well as use browser-based SAML IdPs with multi-factor authentication (MFA)
The following are some commonly used commands in Amazon Redshift RSQL. We use these commands in this post to demonstrate different possible use cases using Amazon Redshift RSQL scripts.
- \echo – Prints the specified string to the output.
- \remark – An extension of the
\echo
command that has the ability to break the output over separate lines. - \goto – Skips all intervening commands and SQL statements and resume the processing at the specified \LABEL in sequence. The
\LABEL
must be a forward reference. You can’t jump to a\LABEL
that lexically precedes the\GOTO
. - \label – Establishes an entry point for running the program as the target for a
\GOTO
command. - \exit – Stops running Amazon Redshift RSQL. You can also specify an optional exit code.
- \q – Logs off database sessions and exits Amazon Redshift RSQL.
Prerequisites
The following are the prerequisites for using Amazon Redshift RSQL and perform the steps in this post:
- An AWS account
- Linux, Windows, or MacOs operating system (Amazon Redshift RSQL is available for these operating systems)
- An Amazon Redshift cluster
- SQL knowledge
Additionally, complete the following prerequisite steps:
- Install Amazon Redshift RSQL on your local machine. For instructions, refer to Getting started with Amazon Redshift RSQL.
- Connect to the Amazon Redshift cluster.
- Create the
orders
andorders_summary
tables using the following DDL statements:
Import data into the Amazon Redshift cluster
There are multiple ways to load data into Amazon Redshift tables, including using Amazon Redshift RSQL. In this section, we review the syntax and an example of the Amazon Redshift RSQL script to load data into an Amazon Redshift table using the COPY command.
We use the following syntax:
We provide the following parameters:
- <location> – The location of the source data to be loaded into the target table
- <arn> – The AWS Identity and Access Management (IAM) role for accessing the data
- <region> – The AWS Region of the source data
In the following example Amazon Redshift RSQL script, we load data from an Amazon Simple Storage Service (Amazon S3) bucket location into the orders
table:
Enhanced control flow
You can use Amazon Redshift RSQL to define programmatic enhanced control flow and conditional blocks in your ETL script. We use the following syntax:
The syntax includes the following components:
- <condition> –The conditional statement
- <code_block1> – The code block to run when the condition is met
- <code_block2> – The code block to run when the condition is not met
In the following example script, we perform some conditional checks using if
, elseif
, and else
commands based on the count of records from the orders
table, and we display some messages based on the record count value:
Error handling
You can use Amazon Redshift RSQL to define exception handling blocks in your in ETL script to handle various user-defined and system-generated error scenarios that you might encounter while running the ETL process.
The following are some of the error handling options supported in Amazon Redshift RSQL:
- :ACTIVITYCOUNT – This variable is similar to the psql variable
ROW_COUNT
, which returns the number of rows affected by last SQL statement. However,ACTIVITYCOUNT
reports the affected row count for SELECT, COPY, and UNLOAD statements, whichROW_COUNT
does not. For SQL statements returning data,ACTIVITYCOUNT
specifies the number of rows returned to Amazon Redshift RSQL from the database. - :ERRORCODE – This variable contains the return code for the last submitted SQL statement to the database. Error code 0 specifies that SQL statement completed without any errors.
- :ERRORLEVEL – This variable is used to assign severity levels to errors. You can use these severity levels to determine a course of action. The default value is ON.
- :MAXERROR – This variable designates a maximum error severity level beyond which Amazon Redshift RSQL halts job processing. If SQL statements in Amazon Redshift RSQL scripts produce an error severity level greater than the designated maxerror value, Amazon Redshift RSQL immediately exits.
- :LAST_ERROR_MESSAGE – This variable contains the error message of the most recent failed query.
We use the following syntax:
The syntax includes the following information:
- <ERRORCODE> –The error code number
- <LAST_ERROR_MESSAGE> – The error message of the most recent failed query
- <code_block1> – The code block to run when the error condition is met
- <code_block2> – The code block to run when the error condition is not met
In the following example script, we create the orders_staging
table and copy records into the table from an Amazon S3 location. The script also contains an exception handling section for both the table creation and copy process to handle the possible errors encountered during the process.
Data transformation and preparation
You can perform some common data preparation and transformation operations on your dataset using SQL statements in the Amazon Redshift RSQL ETL script. In this section, we demonstrate data transformation and preparation operations such as casting, new data column creation, and splitting an input column into multiple columns.
We use industry standard SQL statements for transforming and preparing data for downstream consumption.
In the following example script, we use a SQL statement to transform the data from the orders_staging
table and insert it into the orders_summary
table:
Export data from an Amazon Redshift cluster and output file formatting options
You can use Amazon Redshift RSQL to extract data from one or multiple Amazon Redshift tables and write to your disk for consumption by downstream applications. Amazon Redshift RSQL uses the \EXPORT option to export the result of query to an output file.
The following are some of the useful output formating options supported in RSQL:
- \rset rformat on – This command is required for all the formatting commands to take effect.
- \pset format – Formats can include aligned, AsciiDoc, CSV, HTML, LaTeX, LaTeX
longtable
,troff-ms
, unaligned, and wrapped. - \pset border – This option specifies border information for output data. Value 0 means no border, 1 means internal dividing lines, and 2 means table frame.
- \rset heading – This command adds the specified heading to the output result.
- \rset rtitle – This command adds the specified heading as well as current system date of the client computer.
- \rset titledashes on/off – This command specifies whether to print a line of dash characters between the column names and column data returned for the SQL query.
- \rset width – This command specifies the target width for each line in a report
- \t – This command turns off printing column names as well as result row count at the end of the output (footers).
We use the following syntax:
We provide the following information:
- <‘FilePath/Filename’> – The file name and path for the output file
- <SQL Query> – The SQL query to run
In the following example script, we export the data from the orders_summary
table using a SQL query and write it into the orders.txt
text file on the local machine:
Automate the Amazon Redshift RSQL ETL script
One of the options to automate Amazon Redshift RSQL scripts to run on a specific schedule is via shell scripting. You can schedule the shell script via a CRON job, a command line utility.
We use the following syntax:
We provide the following information:
- <RSQL_SCRIPT> – The SQL scripts to un
- <LOG_FILE> – The output log file
In the following example shell script, we run the Amazon Redshift RSQL script that we created and write the output log in a log file in the local machine. You can schedule the shell script via a CRON job.
Clean up
To avoid incurring future charges, stop the Amazon Redshift cluster created for the purpose of this post.
Conclusion
In this post, we explained how to use Amazon Redshift RSQL to perform ETL operations. We also demonstrated how to implement advanced error handling and enhanced control flow in your Amazon Redshift RSQL ETL script.
If you’re using scripts via the psql command line client on Amazon Redshift, you can operate on Amazon Redshift RSQL with no modification. If you’re migrating your Teradata BTEQ scripts to Amazon Redshift RSQL, you can use the AWS Schema Conversion Tool (AWS SCT) to automatically convert BTEQ scripts to Amazon Redshift RSQL scripts.
To learn more, refer to Amazon Redshift RSQL.
About the Authors
Saman Irfan is a Specialist Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.
Sudipta Bagchi is a Specialist Solutions Architect at Amazon Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket.