Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog

Large organizations processing huge volumes of data usually store it in Amazon Simple Storage Service (Amazon S3) and query the data to make data-driven business decisions using distributed analytics engines such as Amazon Athena. If you simply run queries without considering the optimal data layout on Amazon S3, it results in a high volume of data scanned, long-running queries, and increased cost.

Partitioning is a common technique to lay out your data optimally for distributed analytics engines. By partitioning your data, you can restrict the amount of data scanned by downstream analytics engines, thereby improving performance and reducing the cost for queries.

In this post, we cover the following topics related to Amazon S3 data partitioning:

Understanding table metadata in the AWS Glue Data Catalog and S3 partitions for better performance
How to create a table and load partitions in the Data Catalog using Athena
How partitions are stored in the table
Different ways to add partitions in a table on the Data Catalog
Partitioning data stored in Amazon S3 while ingestion and catalog

Understanding table metadata in the Data Catalog and S3 partitions for better performance

A table in the AWS Glue Data Catalog is the metadata definition that organizes the data location, data type, and column schema, which represents the data in a data store. Partitions are data organized hierarchically, defining the location where the data for a particular partition resides. Partitioning your data allows you to limit the amount of data scanned by S3 SELECT, thereby improving performance and reducing cost.

There are a few factors to consider when deciding the columns on which to partition. For example, if you’re using columns as filters, don’t use a column that is partitioning too finely, or don’t choose a column where your data is heavily skewed to one partition value. You can partition your data by any column. Partition columns are usually designed by a common query pattern in your use case. For example, a common practice is to partition the data based on year/month/day because many queries tend to run time series analyses in typical use cases. This often leads to a multi-level partitioning scheme. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns.

Let’s look at an example of how partitioning works.

Files corresponding to a single day’s worth of data are placed under a prefix such as s3://my_bucket/logs/year=2023/month=06/day=01/.

If your data is partitioned per day, every day you have a single file, such as the following:

s3://my_bucket/logs/year=2023/month=06/day=01/file1_example.json
s3://my_bucket/logs/year=2023/month=06/day=02/file2_example.json
s3://my_bucket/logs/year=2023/month=06/day=03/file3_example.json

We can use a WHERE clause to query the data as follows:

SELECT * FROM table WHERE year=2023 AND month=06 AND day=01

The preceding query reads only the data inside the partition folder year=2023/month=06/day=01 instead of scanning through the files under all partitions. Therefore, it only scans the file file1_example.json.

Systems such as Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by value, eliminating unnecessary (partition) requests to Amazon S3. This capability can improve the performance of applications that specifically need to read a limited number of partitions. For more information about partitioning with Athena and Redshift Spectrum, refer to Partitioning data in Athena and Creating external tables for Redshift Spectrum, respectively.

How to create a table and load partitions in the Data Catalog using Athena

Let’s begin by understanding how to create a table and load partitions using DDL (Data Definition Language) queries in Athena. Note that to demonstrate the various methods of loading partitions into the table, we need to delete and recreate the table multiple times throughout the following steps.

First, we create a database for this demo.

On the Athena console, choose Query editor.

If this is your first time using the Athena query editor, you need to configure and specify an S3 bucket to store the query results.

Create a database with the following command:

CREATE DATABASE partitions_blog;

In the Data pane, for Database, choose the database partitions_blog.
Create the table impressions following the example in Hive JSON SerDe. Replace <myregion> in s3://<myregion>.elasticmapreduce/samples/hive-ads/tables/impressions with the Region identifier where you run Athena (for example, s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions).
Run the following query to create the table:

CREATE EXTERNAL TABLE impressions (
    requestbegintime string,
    adid string,
    impressionid string,
    referrer string,
    useragent string,
    usercookie string,
    ip string,
    number string,
    processid string,
    browsercookie string,
    requestendtime string,
    timers struct
                <
                 modellookup:string, 
                 requesttime:string
                >,
    threadid string, 
    hostname string,
    sessionid string
)   
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions';

The following screenshot shows the query in the query editor.

Run the following query to review the data:

SELECT * FROM impressions;

You can’t see any results because the partitions aren’t loaded yet.

If the partition isn’t loaded into a partitioned table, when the application downloads the partition metadata, the application will not be aware of the S3 path that needs to be queried. For more information, refer to Why do I get zero records when I query my Amazon Athena table.

Load the partitions using the command MSCK REPAIR TABLE.

The MSCK REPAIR TABLE command was designed to manually add partitions that are added to or removed from the file system, such as HDFS or Amazon S3, but are not present in the metastore.

Query the table again to see the results.

After the MSCK REPAIR TABLE command scans Amazon S3 and adds partitions to AWS Glue for Hive-compatible partitions, the records under the registered partitions are now returned.

How partitions are stored in the table metadata

We can list the table partitions in Athena by running the SHOW PARTITIONS command, as shown in the following screenshot.

We also can see the partition metadata on the AWS Glue console. Complete the following steps:

On the AWS Glue console, choose Tables in the navigation pane under Data Catalog.
Choose the impressions table in the partitions_blog database.
On the Partitions tab, choose View Properties next to a partition to view its details.

The following screenshot shows an example of the partition properties.

We can also get the partitions using the AWS Command Line Interface (AWS CLI) command get-partitions, as shown in the following screenshot.

From the get-partitions, the element “Values” defines the partition value and “Location” defines the S3 path to be queried by the application:

"Values": [
                "2009-04-12-19-05"
            ]

When querying the data from the partition dt="2009-04-12-19-05", the application lists and reads only the files in the S3 path s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions/dt="2009-04-12-19-05".

Different ways to add partitions in a table on the Data Catalog

There are multiple ways to load partitions into the table. You can create tables and partitions directly using the AWS Glue API, SDKs, AWS CLI, DDL queries on Athena, using AWS Glue crawlers, or using AWS Glue ETL jobs.

For the next examples, we need to drop and recreate the table. Run the following command in the Athena query editor:

DROP table impressions;

After that, recreate the table:

CREATE EXTERNAL TABLE impressions (
    requestbegintime string,
    adid string,
    impressionid string,
    referrer string,
    useragent string,
    usercookie string,
    ip string,
    number string,
    processid string,
    browsercookie string,
    requestendtime string,
    timers struct
                <
                 modellookup:string, 
                 requesttime:string
                >,
    threadid string, 
    hostname string,
    sessionid string
)   
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions';

Creating partitions individually

If the data arrives in an S3 bucket at a scheduled time, for example every hour or once a day, you can individually add partitions. One way of doing so is by running an ALTER TABLE ADD PARTITION DDL query on Athena.

We use Athena for this query as an example. You can do the same from Hive on Amazon EMR, Spark on Amazon EMR, AWS Glue for Apache Spark jobs, and more.

To load partitions using Athena, we need to use the ALTER TABLE ADD PARTITION command, which can create one or more partitions in the table. ALTER TABLE ADD PARTITION supports partitions created on Amazon S3 with camel case (s3://bucket/table/dayOfTheYear=20), Hive format (s3://bucket/table/dayoftheyear=20), and non-Hive style partitioning schemes used by AWS CloudTrail logs, which use separate path components for date parts, such as s3://bucket/data/2021/01/26/us/6fc7845e.json.

To load partitions into a table, you can run the following query in the Athena query editor:

ALTER TABLE impressions 
  ADD PARTITION (dt = '2009-04-12-19-05');

Refer to ALTER TABLE ADD PARTITION for more information.

Another option is using AWS Glue APIs. AWS Glue provides two APIs to load partitions into table create_partition() and batch_create_partition(). For the API parameters, refer to CreatePartition.

The following example uses the AWS CLI:

aws glue create-partition \
    --database-name partitions_blog \
    --table-name impressions \
    --partition-input '{
                            "Values":["2009-04-14-13-00"],
                            "StorageDescriptor":{
                                "Location":"s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions/dt=2009-04-14-13-00",
                                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                                "SerdeInfo": {
                                    "SerializationLibrary": "org.apache.hive.hcatalog.data.JsonSerDe"
                                }
                            }
                        }'

Both commands (ALTER TABLE in Athena and the AWS Glue API create-partition) will create partition enhancing from the table definition.

Load multiple partitions using MSCK REPAIR TABLE

You can load multiple partitions in Athena. MSCK REPAIR TABLE is a DDL statement that scans the entire S3 path defined in the table’s Location property. Athena lists the S3 path searching for Hive-compatible partitions, then loads the existing partitions into the AWS Glue table’s metadata. A table needs to be created in the Data Catalog, and the data source must be from Amazon S3 before it can run. You can create a table with AWS Glue APIs or by running a CREATE TABLE statement in Athena. After the table creation, run MSCK REPAIR TABLE to load the partitions.

The parameter DDL query timeout in the service quotas defines how long a DDL statement can run. The runtime increases accordingly to the number of folders or partitions in the S3 path.

The MSCK REPAIR TABLE command is best used when creating a table for the first time or when there is uncertainty about parity between data and partition metadata. It supports folders created in lowercase and using Hive-style partitions format (for example, year=2023/month=6/day=01). Because MSCK REPAIR TABLE scans both the folder and its subfolders to find a matching partition scheme, you should keep data for separate tables in separate folder hierarchies.

Every MSCK REPAIR TABLE command lists the entire folder specified in the table location. If you add new partitions frequently (for example, every 5 minutes or every hour), consider scheduling an ALTER TABLE ADD PARTITION statement to load only the partitions defined in the statement instead of scanning the entire S3 path.

The partitions created in the Data Catalog by MSCK REPAIR TABLE enhance the schema from the table definition. Note that Athena doesn’t charge for DDL statements, making MSCK REPAIR TABLE a more straightforward and affordable way to load partitions.

Add multiple partitions using an AWS Glue crawler

An AWS Glue crawler offers more features when loading partitions into the table. A crawler automatically identifies partitions in Amazon S3, extracts metadata, and creates table definitions in the Data Catalog. Crawlers can crawl the following file-based and table-based data stores.

Crawlers can help automate table creation and loading partitions into tables. They are charged per hour, and bill per second. You can optimize the crawler’s performance by altering parameters like the sample size or by specifying it to crawl new folders only.

If the schema of the data changes, the crawler will update the table and partition schemas accordingly. The crawler configuration options have parameters such as update the table definition in the Data Catalog, add new columns only, and ignore the change and don’t update the table in the Data Catalog, which tell the crawler how to update the table when needed and evolve the table schema.

Crawlers can create and update multiple tables from the same data source. When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure and which directories are partitions for the table.

To create an AWS Glue crawler, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane under Data Catalog.
Choose Create crawler.
Provide a name and optional description, then choose Next.
Under Data source configuration, select Not yet and choose Add a data source.
For Data source, choose S3.
For S3 path, enter the path of the impression data (s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions).
Select a preference for subsequent crawler runs.
Choose Add an S3 data source.
Select your data source and choose Next.
Under IAM role, either choose an existing AWS Identity and Access Management (IAM) role or choose Create new IAM role.
Choose Next.
For Target database, choose partitions_blog.
For Table name prefix, enter crawler_.

We use the table prefix to add a custom prefix in front of the table name. For example, if you leave the prefix field empty and start the crawler on s3://my-bucket/some-table-backup, it creates a table with the name some-table-backup. If you add crawler_ as a prefix, it a creates table called crawler_some-table-backup.

Choose your crawler schedule, then choose Next.
Review your settings and create the crawler.
Select your crawler and choose Run.

Wait for the crawler to finish running.

You can go to Athena and check the table was created:

SHOW PARTITIONS crawler_impressions;

Partitioning data stored in Amazon S3 while ingestion and cataloging

The previous examples work with data that already exists in Amazon S3. If you’re using AWS Glue jobs to write data on Amazon S3, you have the option to create partitions with DynamicFrames by enabling the “enableUpdateCatalog=True” parameter. Refer to Creating tables, updating the schema, and adding new partitions in the Data Catalog from AWS Glue ETL jobs for more information.

DynamicFrame supports native partitioning using a sequence of keys, using the partitionKeys option when you create a sink. For example, the following Python code writes out a dataset to Amazon S3 in Parquet format into directories partitioned by the ‘year’ field. After ingesting the data and registering partitions from the AWS Glue job, you can utilize these partitions from queries running on other analytics engines such as Athena.

## Create partitioned table in Glue Data Catalog using DynamicFrame

#Read Dataset
datasource0 = glueContext.create_dynamic_frame.from_catalog(
      database = "default", 
      table_name = "flight_delays_pq", 
      transformation_ctx = "datasource0")

#Create Sink
sink = glueContext.getSink(
    connection_type="s3", 
    path="s3://BUCKET/glueetl/",
    enableUpdateCatalog=True,
    partitionKeys=[ "year"])
    
sink.setFormat("parquet", useGlueParquetWriter=True)

sink.setCatalogInfo(catalogDatabase="default", catalogTableName="test_table")

#Write data, create table and add partitions
sink.writeFrame(datasource0)
job.commit()

Conclusion

This post showed multiple methods for partitioning your Amazon S3 data, which helps reduce costs by avoiding unnecessary data scanning and also improves the overall performance of your processes. We further described how AWS Glue makes effective metadata management for partitions possible, allowing you to optimize your storage and query operations in AWS Glue and Athena. These partitioning methods can help optimize scanning high volumes of data or long-running queries, as well as reduce the cost of scanning.

We hope you try out these options!

About the authors

Anderson Santos is a Senior Solutions Architect at Amazon Web Services. He works with AWS Enterprise customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS.

Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build and reinvent. He is creative, fast-paced, deeply customer-obsessed and leverages the working backwards process to build modern architectures to help customers solve their unique challenges.

Patrick Muller is a Senior Solutions Architect and a valued member of the Datalab. With over 20 years of expertise in analytics, data warehousing, and distributed systems, he brings extensive knowledge to the table. Patrick’s passion lies in evaluating new technologies and assisting customers with innovative solutions. During his free time, he enjoys watching soccer.

AWS Big Data Blog