AWS Storage Blog
How Amazon S3 Tables use compaction to improve query performance by up to 3 times
Today businesses managing petabytes of data must optimize storage and processing to drive timely insights while being cost-effective. Customers often choose Apache Parquet for improved storage and query performance. Additionally, customers use Apache Iceberg to organize Parquet datasets to take advantage of its database-like features such as schema evolution, time travel, and ACID transactions. Customers use Iceberg to store datasets generated from real-time streaming, change data capture (CDC), log analysis use cases, and more. Such workloads involve frequent and granular updates, resulting in numerous small files within these Iceberg datasets.
As the number of data files grows, it can adversely affect query performance for downstream applications reading these datasets. A dataset spread over multiple small Parquet files requires the query engine to make many small data reads per query, each with a fixed overhead. Thus, an increase in small files from multiple granular writes can quickly result in a number of read requests that impact end-to-end query performance.
Consolidating small Parquet files into larger ones allows query engines to read larger data ranges with fewer requests, resulting in higher total read throughput. This process, known as compaction, optimizes storage efficiency and improves query performance by reducing the overhead associated with accessing multiple small files. Performing compaction on tables at scale however, can be resource-intensive and challenging to manage effectively. At AWS re:Invent 2024, we introduced Amazon S3 Tables, purpose-built for storing and managing tabular data at scale using the Apache Iceberg standard. S3 Tables automatically perform compaction in addition to other maintenance tasks, such as snapshot management and unreferenced file removal.
To demonstrate the performance benefits of S3 Tables’ automatic compaction, we conducted tests comparing query performance between an uncompacted Iceberg table in a general purpose bucket, often seen in self-managed environments, and a fully-managed table in a table bucket. In this post, we analyze the results of our tests and discuss the performance benefits of S3 Tables.
Benchmarking setup
To simulate real-world scenarios with frequent and granular updates, we used a 3TB TPC-DS dataset partitioned into 1MB files. This is typical of workloads experiencing high-velocity, incremental data ingestion. We chose a subset of eight IO intensive queries from TPC-DS to evaluate the impact of Parquet data storage optimizations on query performance. Our tests ran on an Amazon EMR 7.5.0 cluster comprising 9 r5dn.4xlarge instances (1 primary, 8 core), each with 16 vCPUs, 128GB memory, and 600GB SSD storage. This setup allowed us to compare performance between an uncompacted Iceberg table in a general purpose bucket and a compacted table in a table bucket. We conducted tests in two stages: baseline measurements on uncompacted Iceberg tables in a general purpose bucket and performance tests on compacted tables in a table bucket. For each stage, we did five test iterations and calculated the mean query execution time.
Observations
Our results revealed significant performance improvements when using datasets compacted by S3 Tables. With compaction enabled on the table bucket, we observed query acceleration up to 3.2x, with queries in a table bucket consistently performing better (1.1x to 3.2x) than self-managed tables in a general purpose bucket. Overall, we saw a 2.26x improvement in the total execution time for all eight queries. This enhancement can be attributed to the reduction in read requests due to larger object sizes from compaction.
For instance, the queries run against the 1MB Parquet objects of uncompacted tables required 8.5x more read requests compared to those run against the 512MB Parquet objects of tables in the table bucket. The small object sizes require the query engine to make many small, kilobyte-ranged requests, which is inherently less efficient. In contrast, the compacted dataset allowed the compute engine to read much larger ranges with fewer requests, significantly boosting total read throughput. Actual performance improvements may vary depending on specific workloads, file sizes, query engine configurations, and data access patterns. Our findings should be considered as indicative rather than definitive across all scenarios.
A | B | C | D | |
1 | TPC-DS Query ID |
Uncompacted table in general purpose bucket (seconds) |
Compacted table in table bucket (seconds) |
Performance improvements |
2 | 25 | 51.8 | 46.39 | 1.12x |
3 | 31 | 117.21 | 45.24 | 2.59x |
4 | 49 | 134.51 | 60.43 | 2.23x |
5 | 76 | 45.61 | 19.84 | 2.3x |
6 | 77 | 55.79 | 19.91 | 2.8x |
7 | 80 | 62.96 | 40.56 | 1.55x |
8 | 88 | 180.94 | 56.2 | 3.22x |
9 | 96 | 23.63 | 8.34 | 2.83x |
10 | Total runtime | 672.46 | 296.92 | 2.26x |
Table 1: Query avg. runtime (in seconds) across repeated runs
As observed, compacting small Parquet files into larger ones improves query performance, but customers must continuously optimize the objects to maintain a performant data lake. Regular compaction on tables at scale in self-managed environments is a complex task. It often demands dedicated compute clusters and skilled teams to manage these tables, adding significant overhead to data operations. S3 Tables simplify this process by automatically performing compaction on tables stored in table buckets without any manual intervention. This approach helps ensure that datasets are always optimized, without the need for additional infrastructure or specialized management, allowing customers to streamline data operations and significantly reduce operational complexity.
To get started with S3 Tables, simply create a new table bucket and a table within this bucket. By default, each table is assigned a target file size of 512MB, but as workload requirements evolve, you can flexibly adjust this target size anywhere between 64MB and 512MB. S3 writes compacted objects as the most recent snapshot of the table, helping to ensure the data remains current and efficiently organized.
Conclusion
In this post, we discussed the results of our benchmarking studies on self-managed (uncompacted) tables in a general purpose bucket and managed tables in a table bucket. S3 Tables with automatic compaction, can deliver up to 3 times improvement in query performance for storage intensive workloads compared to self-managed tables in S3. To learn more about getting started with Amazon S3 Tables, read the S3 User Guide.