Perform ETL operations using Amazon Redshift RSQL

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

There are many ways to interact with Amazon Redshift. You can programmatically access data in your Amazon Redshift cluster by using the Amazon Redshift Data API, or you can use a web-based interface such as Amazon Redshift Query Editor V2 to author SQL queries. You can also interact with Amazon Redshift in interactive or batch mode via Amazon Redshift RSQL.

Previously, you had to use the PostgreSQL psql command line tool to interact with Amazon Redshift for simple use cases such as importing and exporting data to and from Amazon Redshift or submitting SQL in batch or interactive mode, or for advanced use cases such as performing ETL operations without writing complex ETL jobs.

Now you can use Amazon Redshift RSQL, a new command line tool to connect to an Amazon Redshift cluster and perform operations such as describe database objects, query data, view and export query results in various output file formats, run scripts that include both SQL and complex business logic, perform data cleansing and transformation operations using familiar SQL, and write ETL and ELT scripts using enhanced control flow and advanced error handling. Moreover, if you’re migrating from self-managed data warehousing engines such as Teradata to Amazon Redshift and using Teradata BTEQ (Basic Teradata Query) scripting for data automation, ETL or other tasks can now migrate to Amazon Redshift RSQL.

This post explains how to use Amazon Redshift RSQL for ETL, data cleansing and preparation, enhanced control flow, and exception and error handling.

Solution overview

Amazon Redshift RSQL enables you to connect to your Amazon Redshift cluster via a command line. It supports the capabilities of the PostgreSQL psql command line tool with an additional set of Amazon Redshift specific capabilities:

Describe properties or attributes of external tables in an AWS Glue catalog or Apache Hive metastore, external databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, and tables shared using Amazon Redshift data sharing
Use enhanced control flow commands such as \IF, \ELSEIF, \ELSE, \ENDIF, \GOTO, and \LABEL
Use single sign-on (SSO) authentication using ADFS, PingIdentity, Okta, Azure AD, or other SAML/JWT based identity providers (IdPs), as well as use browser-based SAML IdPs with multi-factor authentication (MFA)

The following are some commonly used commands in Amazon Redshift RSQL. We use these commands in this post to demonstrate different possible use cases using Amazon Redshift RSQL scripts.

\echo – Prints the specified string to the output.
\remark – An extension of the \echo command that has the ability to break the output over separate lines.
\goto – Skips all intervening commands and SQL statements and resume the processing at the specified \LABEL in sequence. The \LABEL must be a forward reference. You can’t jump to a \LABEL that lexically precedes the \GOTO.
\label – Establishes an entry point for running the program as the target for a \GOTO command.
\exit – Stops running Amazon Redshift RSQL. You can also specify an optional exit code.
\q – Logs off database sessions and exits Amazon Redshift RSQL.

Prerequisites

The following are the prerequisites for using Amazon Redshift RSQL and perform the steps in this post:

An AWS account
Linux, Windows, or MacOs operating system (Amazon Redshift RSQL is available for these operating systems)
An Amazon Redshift cluster
SQL knowledge

Additionally, complete the following prerequisite steps:

Install Amazon Redshift RSQL on your local machine. For instructions, refer to Getting started with Amazon Redshift RSQL.
Connect to the Amazon Redshift cluster.

Create the orders and orders_summary tables using the following DDL statements:

create table orders (
  O_ORDERKEY bigint NOT NULL,
  O_CUSTKEY bigint,
  O_ORDERSTATUS varchar(1),
  O_TOTALPRICE decimal(18,4),
  O_ORDERDATE Date,
  O_ORDERPRIORITY varchar(15),
  O_CLERK varchar(15),
  O_SHIPPRIORITY Integer,
  O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

CREATE TABLE orders_summary 
( o_orderkey bigint, 
 o_custkey bigint, 
 o_orderstatus character varying(1),
 o_totalprice integer,
 target_information character varying(14),
 rank character varying(15),
 description character varying(15)
) DISTSTYLE AUTO;

Import data into the Amazon Redshift cluster

There are multiple ways to load data into Amazon Redshift tables, including using Amazon Redshift RSQL. In this section, we review the syntax and an example of the Amazon Redshift RSQL script to load data into an Amazon Redshift table using the COPY command.

We use the following syntax:

COPY <TABLE> from <location> 
iam_role <arn>
region <region>;

We provide the following parameters:

<location> – The location of the source data to be loaded into the target table
<arn> – The AWS Identity and Access Management (IAM) role for accessing the data
<region> – The AWS Region of the source data

In the following example Amazon Redshift RSQL script, we load data from an Amazon Simple Storage Service (Amazon S3) bucket location into the orders table:

\echo 'Job started' 
  copy orders from 's3://redshift-immersionday-labs/data/orders/orders.tbl.'
  iam_role default
  region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;
\echo 'Job Ended'
\exit 0

Enhanced control flow

You can use Amazon Redshift RSQL to define programmatic enhanced control flow and conditional blocks in your ETL script. We use the following syntax:

\if <condition> 
  <code_block1>
\else
  <code_block2>
\endif

The syntax includes the following components:

<condition> –The conditional statement
<code_block1> – The code block to run when the condition is met
<code_block2> – The code block to run when the condition is not met

In the following example script, we perform some conditional checks using if, elseif, and else commands based on the count of records from the orders table, and we display some messages based on the record count value:

\echo 'Job started'
Select count(*)  from orders \gset
select :count as count;
\if :count < 76000000 
  \echo 'Orders are less than target'
\elseif :count =76000000
  \echo 'Order met the target'
\else :count > 76000000
  \echo 'Orders exceeded the target'
\endif
\echo 'Job Ended' 
\exit 0

Error handling

You can use Amazon Redshift RSQL to define exception handling blocks in your in ETL script to handle various user-defined and system-generated error scenarios that you might encounter while running the ETL process.

The following are some of the error handling options supported in Amazon Redshift RSQL:

:ACTIVITYCOUNT – This variable is similar to the psql variable ROW_COUNT, which returns the number of rows affected by last SQL statement. However, ACTIVITYCOUNT reports the affected row count for SELECT, COPY, and UNLOAD statements, which ROW_COUNT does not. For SQL statements returning data, ACTIVITYCOUNT specifies the number of rows returned to Amazon Redshift RSQL from the database.
:ERRORCODE – This variable contains the return code for the last submitted SQL statement to the database. Error code 0 specifies that SQL statement completed without any errors.
:ERRORLEVEL – This variable is used to assign severity levels to errors. You can use these severity levels to determine a course of action. The default value is ON.
:MAXERROR – This variable designates a maximum error severity level beyond which Amazon Redshift RSQL halts job processing. If SQL statements in Amazon Redshift RSQL scripts produce an error severity level greater than the designated maxerror value, Amazon Redshift RSQL immediately exits.
:LAST_ERROR_MESSAGE – This variable contains the error message of the most recent failed query.

We use the following syntax:

\if :ERROR <> 0 
  \echo :<ERRORCODE>
  \echo :<LAST_ERROR_MESSAGE>
  \goto <codeblock1>
\else
  \goto Y
\endif

The syntax includes the following information:

<ERRORCODE> –The error code number
<LAST_ERROR_MESSAGE> – The error message of the most recent failed query
<code_block1> – The code block to run when the error condition is met
<code_block2> – The code block to run when the error condition is not met

In the following example script, we create the orders_staging table and copy records into the table from an Amazon S3 location. The script also contains an exception handling section for both the table creation and copy process to handle the possible errors encountered during the process.

\echo `date`
\echo 'Job started'
DROP TABLE IF EXISTS orders_staging;

create table orders_staging (
O_ORDERKEY bigint NOT NULL,
O_CUSTKEY bigint,
O_ORDERSTATUS varchar(1),
O_TOTALPRICE decimal(18,4),
O_ORDERDATE Date,
O_ORDERPRIORITY varchar(15),
O_CLERK varchar(15),
O_SHIPPRIORITY Integer,
O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

\if :ERROR <> 0 
  \echo :ERRORCODE
  \remark :LAST_ERROR_MESSAGE
  \goto QUIT_JOB
\else
  \remark '***Orders_Staging Table Created Successfully****'
  \goto COPY_DATA
\endif

\label COPY_DATA
 copy orders_staging from 's3://redshift-immersionday-  labs/data/orders/orders.tbl.'
 iam_role default
 region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

\if :ERROR <> 0
  \echo :ERRORCODE
  \remark :LAST_ERROR_MESSAGE
  \goto QUIT_JOB
\else 
  \remark '****Data Copied Successfully****'
\endif

\echo `date`
\echo 'Job Ended'
\exit 0 

\label QUIT_JOB
 \echo `date`
 \echo 'Job Failed'
 \exit 1

Data transformation and preparation

You can perform some common data preparation and transformation operations on your dataset using SQL statements in the Amazon Redshift RSQL ETL script. In this section, we demonstrate data transformation and preparation operations such as casting, new data column creation, and splitting an input column into multiple columns.

We use industry standard SQL statements for transforming and preparing data for downstream consumption.

In the following example script, we use a SQL statement to transform the data from the orders_staging table and insert it into the orders_summary table:

\echo `date`
\echo 'Job started'

insert into orders_summary 
(o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice,
target_information,
rank,
description)
select
o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice::int,
case 
when o_totalprice < 200000
then 'Target Missed'
when o_totalprice = 200000
then 'Target Met'
when o_totalprice > 200000
then 'Target Exceeds'
end as "Target_Information",
split_part (o_orderpriority,'-',1) as RANK, 
split_part (o_orderpriority,'-',2) as DESCRIPTION
from orders_staging; 

\if :ERROR <> 0 or :ACTIVITYCOUNT=0
 \echo :ERRORCODE
 \remark :LAST_ERROR_MESSAGE
 \goto QUIT_JOB
\else
 \remark 'Data Inserted into Summary Orders Table'
\endif

\echo `date`
\echo 'Job Ended'
\exit 0 

\label QUIT_JOB
 \echo `date`
 \echo 'Job Failed'
 \exit 1

Export data from an Amazon Redshift cluster and output file formatting options

You can use Amazon Redshift RSQL to extract data from one or multiple Amazon Redshift tables and write to your disk for consumption by downstream applications. Amazon Redshift RSQL uses the \EXPORT option to export the result of query to an output file.

The following are some of the useful output formating options supported in RSQL:

\rset rformat on – This command is required for all the formatting commands to take effect.
\pset format – Formats can include aligned, AsciiDoc, CSV, HTML, LaTeX, LaTeX longtable, troff-ms, unaligned, and wrapped.
\pset border – This option specifies border information for output data. Value 0 means no border, 1 means internal dividing lines, and 2 means table frame.
\rset heading – This command adds the specified heading to the output result.
\rset rtitle – This command adds the specified heading as well as current system date of the client computer.
\rset titledashes on/off – This command specifies whether to print a line of dash characters between the column names and column data returned for the SQL query.
\rset width – This command specifies the target width for each line in a report
\t – This command turns off printing column names as well as result row count at the end of the output (footers).

We use the following syntax:

\export report file=<'FilePath/Filename'>
\rset rformat on
\pset format wrapped
\pset border 2
\rset heading ‘This is Heading’
\rset width 50
\rset titledashes on
<SQL Query>
\export reset

We provide the following information:

<‘FilePath/Filename’> – The file name and path for the output file
<SQL Query> – The SQL query to run

In the following example script, we export the data from the orders_summary table using a SQL query and write it into the orders.txt text file on the local machine:

\echo `date`
\echo 'Job started'

\export report file='/<FilePath>/orders.txt'
\rset rformat on
\pset format wrapped
\rset width 50
\rset titledashes on
select * from orders_summary limit 100;

\export reset
\echo 'Job Ended'
\echo `date`
\exit 0

Automate the Amazon Redshift RSQL ETL script

One of the options to automate Amazon Redshift RSQL scripts to run on a specific schedule is via shell scripting. You can schedule the shell script via a CRON job, a command line utility.

We use the following syntax:

#!/bin/sh
rsql -D awsuser -f <RSQL_SCRIPT> <LOG_FILE>

We provide the following information:

<RSQL_SCRIPT> – The SQL scripts to un
<LOG_FILE> – The output log file

In the following example shell script, we run the Amazon Redshift RSQL script that we created and write the output log in a log file in the local machine. You can schedule the shell script via a CRON job.

#!/bin/sh
SCRIPTS_DIR="<SCRIPTS_DIR>"
LOG_DIR="<LOG_DIR>"

RSQL_SCRIPT="${SCRIPTS_DIR}/<RSQL_SCRIPT>.sql"
LOG_FILE="${LOG_DIR}/test_rsql.log"

touch $LOG_FILE

rsql -D awsuser -f ${RSQL_SCRIPT} > ${LOG_FILE}

Clean up

To avoid incurring future charges, stop the Amazon Redshift cluster created for the purpose of this post.

Conclusion

In this post, we explained how to use Amazon Redshift RSQL to perform ETL operations. We also demonstrated how to implement advanced error handling and enhanced control flow in your Amazon Redshift RSQL ETL script.

If you’re using scripts via the psql command line client on Amazon Redshift, you can operate on Amazon Redshift RSQL with no modification. If you’re migrating your Teradata BTEQ scripts to Amazon Redshift RSQL, you can use the AWS Schema Conversion Tool (AWS SCT) to automatically convert BTEQ scripts to Amazon Redshift RSQL scripts.

To learn more, refer to Amazon Redshift RSQL.

About the Authors

Saman Irfan is a Specialist Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Sudipta Bagchi is a Specialist Solutions Architect at Amazon Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket.

AWS Big Data Blog

Perform ETL operations using Amazon Redshift RSQL

Solution overview

Prerequisites

Import data into the Amazon Redshift cluster

Enhanced control flow

Error handling

Data transformation and preparation

Export data from an Amazon Redshift cluster and output file formatting options

Automate the Amazon Redshift RSQL ETL script

Clean up

Conclusion

About the Authors

Resources

Follow