AWS Big Data Blog
Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena
April 2024: This post was reviewed for accuracy.
Amazon Athena is an interactive query service that makes it easy to analyze the data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. You can reduce your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. To learn more about best practices to boost query performance and reduce costs, see Top 10 Performance Tuning Tips for Amazon Athena.
Overview
This blog post discusses how to use Athena for extract, transform and load (ETL) jobs for data processing. This example optimizes the dataset for analytics by partitioning it and converting it to a columnar data format using Create Table as Select (CTAS) and INSERT INTO statements.
CTAS statements create new tables using standard SELECT queries to filter data as required. You can also partition the data, specify compression, and convert the data into columnar formats like Apache Parquet and Apache ORC using CTAS statements. As part of the execution, the resultant tables and partitions are added to the AWS Glue Data Catalog, making them immediately available for subsequent queries.
INSERT INTO statements insert new rows into a destination table based on a SELECT query statement that runs on a source table. If the source table’s underlying data is in CSV format and destination table’s data is in Parquet format, then INSERT INTO can easily transform and load data into destination table’s format. CTAS and INSERT INTO statements can be used together to perform an initial batch conversion of data as well as incremental updates to the existing table.
Here is an overview of the ETL steps to be followed in Athena for data conversion:
- Create a table on the original dataset.
- Use a CTAS statement to create a new table in which the format, compression, partition fields and location of the new table can be specified.
- Add more data into the table using an INSERT INTO statement.
This example uses a subset of NOAA Global Historical Climatology Network Daily (GHCN-D), a publicly available dataset on Amazon S3, in this example.
This subset of data is available at the following S3 location:
Procedure
Follow these steps to use Athena for an ETL job.
Create a table based on original dataset
The original data is in CSV format with no partitions in Amazon S3. The following files are stored at the Amazon S3 location:
Note that the file sizes are pretty small. Merging them into larger files and reducing total number of files would lead to faster query execution. CTAS and INSERT INTO can help achieve this.
To execute queries in the Athena console (preferably in us-east-1 to avoid inter-region Amazon S3 data transfer charges). First, create a database for this demo:
Now, create a table from the data above.
Use CTAS to partition data and convert into parquet format with snappy compression
Now, convert the data to Parquet format with Snappy compression and partition the data on a yearly basis. All these actions are performed using the CTAS statement. For the purpose of this blog, the initial table only includes data from 2015 to 2019. You can add new data to this table using the INSERT INTO command.
The table created in Step 1 has a date field with the date formatted as YYYYMMDD (e.g. 20100104). The new table is partitioned on year. Extract the year value from the date field using the Presto function substr(“date”,1,4).
Once the query is successful, check the Amazon S3 location specified in the CTAS statement above. You should be able to see partitions and parquet files in each of these partitions, as shown in the following examples:
- Partitions:
- Parquet files:
Add more data into table using INSERT INTO statement
Now, add more data and partitions into the new table created above. The original dataset has data from 2010 to 2019. Since you added 2015 to 2019 using CTAS, add the rest of the data now using an INSERT INTO statement:
List the Amazon S3 location of the new table:
You can see that INSERT INTO is able to determine that “year” is a partition column and writes the data to Amazon S3 accordingly. There is also a significant reduction in the total size of the dataset thanks to compression and columnar storage in the Parquet format:
You can also run INSERT INTO statements if more CSV data is added to original table. Assume you have new data for the year 2020 added to the original Amazon S3 dataset. In that case, you can run the following INSERT INTO statement to add this data and the relevant partition(s) to the new_parquet table:
Query the results
Now that you have transformed data, run some queries to see what you gained in terms of performance and cost optimization:
First, find the number of distinct IDs for every value of the year:
-
- Query on the original table:
- Query on the new table:
Original table
New table Savings
Run time Data scanned Cost Run
Time
Data
Scanned
Cost 16.88 seconds 11.35 GB $0.0567 3.79 seconds 428.05 MB $0.002145 77.5% faster and 96.2% cheaper
Next, calculate the average maximum temperature (Celsius), average minimum temperature (Celsius), and average rainfall (mm) for the Earth in 2018:
-
-
- Query on the original table:
-
-
-
- Query on the new table:
Original table New table Savings Run time Data scanned Cost Run
Time
Data
Scanned
Cost 18.65 seconds 11.35 GB $0.0567 1.92 seconds 68.08 MB $0.000345 90% faster and 99.4% cheaper
- Query on the new table:
-
Conclusion
This post showed you how to perform ETL operations using CTAS and INSERT INTO statements in Athena. You can perform the first set of transformations using a CTAS statement. When new data arrives, use an INSERT INTO statement to transform and load data to the table created by the CTAS statement. Using this approach, you converted data to the Parquet format with Snappy compression, converted a non-partitioned dataset to a partitioned dataset, reduced the overall size of the dataset and lowered the costs of running queries in Athena.
About the Author
Pathik Shah is a big data architect for Amazon EMR at AWS.
Audit History
Last reviewed and updated in April 2024 by Priyanka Chaudhary | Sr. Solutions Architect