AWS Big Data Blog
Choosing an open table format for your transactional data lake on AWS
August 2023: This post was updated to include Apache Iceberg support in Amazon Redshift.
Disclaimer: Due to rapid advancements in AWS service support for open table formats, recent developments might not yet be reflected in this post. For the latest information on AWS service support for open table formats, refer to the official AWS service documentation.
A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale. This data is then projected into analytics services such as data warehouses, search systems, stream processors, query editors, notebooks, and machine learning (ML) models through direct access, real-time, and batch workflows. Data in customers’ data lakes is used to fulfil a multitude of use cases, from real-time fraud detection for financial services companies, inventory and real-time marketing campaigns for retailers, or flight and hotel room availability for the hospitality industry. Across all use cases, permissions, data governance, and data protection are table stakes, and customers require a high level of control over data security, encryption, and lifecycle management.
This post shows how open-source transactional table formats (or open table formats) can help you solve advanced use cases around performance, cost, governance, and privacy in your data lakes. We also provide insights into the features and capabilities of the most common open table formats available to support various use cases.
You can use this post for guidance when looking to select an open table format for your data lake workloads, facilitating the decision-making process and potentially narrowing down the available options. The content of this post is based on the latest open-source releases of the reviewed formats at the time of writing: Apache Hudi v0.13.0, Apache Iceberg 1.2.0, and Delta Lake 2.3.0.
Contents
Advanced use cases in modern data lakes
Data lakes offer one of the best options for cost, scalability, and flexibility to store data, allowing you to retain large volumes of structured and unstructured data at a low cost, and to use this data for different types of analytics workloads—from business intelligence reporting to big data processing, real-time analytics, and ML—to help guide better decisions.
Despite these capabilities, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies. For example:
- Performing efficient record-level updates and deletes as data changes in your business
- Managing query performance as tables grow to millions of files and hundreds of thousands of partitions
- Ensuring data consistency across multiple concurrent writers and readers
- Preventing data corruption from write operations failing partway through
- Evolving table schemas over time without (partially) rewriting datasets
These challenges have become particularly prevalent in use cases such as CDC (change data capture) from relational database sources, privacy regulations requiring deletion of data, and streaming data ingestion, which can result in many small files. Typical data lake file formats such as CSV, JSON, Parquet, or Orc only allow for writes of entire files, making the aforementioned requirements hard to implement, time consuming, and costly.
To help overcome these challenges, open table formats provide additional database-like functionality that simplifies the optimization and management overhead of data lakes, while still supporting storage on cost-effective systems like Amazon Simple Storage Service (Amazon S3). These features include:
- ACID transactions – Allowing a write to completely succeed or be rolled back in its entirety
- Record-level operations – Allowing for single rows to be inserted, updated, or deleted
- Indexes – Improving performance in addition to data lake techniques like partitioning
- Concurrency control – Allowing for multiple processes to read and write the same data at the same time
- Schema evolution – Allowing for columns of a table to be added or modified over the life of a table
- Time travel – Enabling you to query data as of a point in time in the past
In general, open table formats implement these features by storing multiple versions of a single record across many underlying files, and use a tracking and indexing mechanism that allows an analytics engine to see or modify the correct version of the records they are accessing. When records are updated or deleted, the changed information is stored in new files, and the files for a given record are retrieved during an operation, which is then reconciled by the open table format software. This is a powerful architecture that is used in many transactional systems, but in data lakes, this can have some side effects that have to be addressed to help you align with performance and compliance requirements. For instance, when data is deleted from an open table format, in some cases only a delete marker is stored, with the original data retained until a compaction or vacuum operation is performed, which performs a hard deletion. For updates, previous versions of the old values of a record may be retained until a similar process is run. This can mean that data that should be deleted isn’t, or that you store a significantly larger number of files than you intend to, increasing storage cost and slowing down read performance. Regular compaction and vacuuming must be run, either as part of the way the open table format works, or separately as a maintenance procedure.
The three most common and prevalent open table formats are Apache Hudi, Apache Iceberg, and Delta Lake. AWS supports all three of these open table formats, and in this post, we review the features and capabilities of each, how they can be used to implement the most common transactional data lake use cases, and which features and capabilities are available in AWS’s analytics services. Innovation around these table formats is happening at an extremely rapid pace, and there are likely features available in these file formats that aren’t covered here. All due care has been taken to provide the correct information as of time of writing, but we also expect this information to change quickly, and we’ll update this post frequently to contain the most accurate information. Also, this post focuses only on the open-source versions of the covered table formats and it doesn’t cover experimental or preview features. Extensions or proprietary features available from individual third-party vendors are also out of the scope.
How to use this post
We encourage you to use the high-level guidance in this post with the mapping of functional fit and supported integrations for your use cases. Combine both aspects to identify what table format is likely a good fit for a specific use case, and then prioritize your proof of concept efforts accordingly. Most organizations have a variety of workloads that can benefit from an open table format, but today no single table format is a “one size fits all.” You may wish to select a specific open table format on a case-by-case basis to get the best performance and features for your requirements, or you may wish to standardize on a single format and understand the trade-offs that you may encounter as your use cases evolve.
This post doesn’t promote a single table format for any given use case. The functional evaluations are only intended to help speed up your decision-making process by highlighting key features and attention points for each table format with each use case. It is crucial that you perform testing to ensure that a table format meets your specific use case requirements.
This post is not intended to provide detailed technical guidance (e.g. best practices) or benchmarking of each of the specific file formats, which are available in AWS Technical Guides and benchmarks from the open-source community respectively.
Choosing an open table format
When choosing an open table format for your data lake, we believe that there are two critical aspects that should be evaluated:
- Functional fit – Does the table format offer the features required to efficiently implement your use case with the required performance? Although they all offer common features, each table format has a different underlying technical design and may support unique features. Each format can handle a range of use cases, but they also offer specific advantages or trade-offs, and may be more efficient in certain scenarios as a result of its design.
- Supported integrations – Does the table format integrate seamlessly with your data environment? When evaluating a table format, it’s important to consider supported engine integrations on dimensions such as support for reads/writes, data catalog integration, supported access control tools, and so on that you have in your organization. This applies to both integration with AWS services and with third-party tools.
General features and considerations
The following table summarizes general features and considerations for each file format that you may want to take into account, regardless of your use case. In addition to this, it is also important to take into account other aspects such as the complexity of the table format and in-house skills.
. | Apache Hudi | Apache Iceberg | Delta Lake | |
---|---|---|---|---|
Primary API∇ |
|
|
|
|
Write modes |
|
|
|
|
Supported data file formats |
|
|
|
|
File layout management |
|
|
|
|
Query optimization |
|
|
|
|
S3 optimizations |
|
|
|
|
Table maintenance |
|
|
|
|
Time travel |
|
|
|
|
Schema evolution |
|
|
|
|
Operations |
|
|
|
|
Monitoring |
|
|
|
|
Data Encryption |
|
|
|
|
Configuration Options |
Extensive configuration options for customizing read/write behavior (such as index type or merge logic) and automatically performed maintenance and optimizations (such as file sizing, compaction, and cleaning) |
Configuration options for basic read/write behavior (Merge On Read or Copy On Write operation modes) |
Limited configuration options for table properties (for example, indexed columns) |
|
Other |
|
|
|
|
AWS Analytics Services Support* | ||||
Amazon EMR | Read and write | Read and write | Read and write | |
AWS Glue for Apache Spark | Read and write | Read and write | Read and write | |
Amazon Athena (SQL) | Read | Read and write | Read | |
Amazon Athena (Spark) | Read and write | Read and write | Read and write | |
Amazon Redshift (Spectrum) | Read | Read | Read† | |
AWS Glue Data Catalog‡ | Yes | Yes | Yes |
∇ Spark API with most extensive feature support for the table format
* For table format support in third-party tools, consult the official documentation for the respective tool.
† Amazon Redshift only supports Delta Symlink tables (see Creating external tables for data managed in Delta Lake for more information).
‡ Refer to Working with other AWS services in the Lake Formation documentation for an overview of table format support when using Lake Formation with other AWS services.
Functional fit for common use cases
Now let’s dive deep into specific use cases to understand the capabilities of each open table format.
Getting data into your data lake
In this section, we discuss the capabilities of each open table format for streaming ingestion, batch load and change data capture (CDC) use cases.
Streaming ingestion
Streaming ingestion allows you to write changes from a queue, topic, or stream into your data lake. Although your specific requirements may vary based on the type of use case, streaming data ingestion typically requires the following features:
- Low-latency writes – Supporting record-level inserts, updates, and deletes, for example to support late-arriving data
- File size management – Enabling you to create files that are sized for optimal read performance (rather than creating one or more files per streaming batch, which can result in millions of tiny files)
- Support for concurrent readers and writers – Including schema changes and table maintenance
- Automatic table management services – Enabling you to maintain consistent read performance
In this section, we talk about streaming ingestion where records are just inserted into files, and you aren’t trying to update or delete previous records based on changes. A typical example of this is time series data (for example sensor readings), where each event is added as a new record to the dataset. The following table summarizes the features.
. | Apache Hudi | Apache Iceberg | Delta Lake |
Functional fit |
|
|
|
Considerations | Hudi’s default configurations are tailored for upserts, and need to be tuned for append-only streaming workloads. For example, Hudi’s automatic file sizing in the writer minimizes operational effort/complexity required to maintain read performance over time, but can add a performance overhead at write time. If write speed is of critical importance, it can be beneficial to turn off Hudi’s file sizing, write new data files for each batch (or micro-batch), then run clustering later to create better sized files for read performance (using a similar approach as Iceberg or Delta). |
|
|
Supported AWS integrations |
|
|
|
Conclusion | Good functional fit for all append-only streaming when configuration tuning for append-only workloads is acceptable. | Good fit for append-only streaming with larger micro-batch windows, and when operational overhead of table management is acceptable. | Good fit for append-only streaming with larger micro-batch windows, and when operational overhead of table management is acceptable. |
When streaming data with updates and deletes into a data lake, a key priority is to have fast upserts and deletes by being able to efficiently identify impacted files to be updated.
. | Apache Hudi | Apache Iceberg | Delta Lake |
Functional fit |
|
|
|
Considerations |
|
|
|
Supported AWS integrations |
|
|
|
Conclusion | Good fit for lower-latency streaming with updates and deletes thanks to native support for streaming upserts, indexes for upserts, and automatic file sizing and compaction. | Good fit for streaming with larger micro-batch windows and when the operational overhead of table management is acceptable. | Can be used for streaming data ingestion with updates/deletes if latency is not a concern, because a Copy-On-Write strategy may not deliver the write performance required by low latency streaming use cases. |
Change data capture
Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real time to a downstream process or system—in this case, delivering CDC data from databases into Amazon S3.
In addition to the aforementioned general streaming requirements, the following are key requirements for efficient CDC processing:
- Efficient record-level updates and deletes – With the ability to efficiently identify files to be modified (which is important to support late-arriving data).
- Native support for CDC – With the following options:
- CDC record support in the table format – The table format understands how to process CDC-generated records and no custom preprocessing is required for writing CDC records to the table.
- CDC tools natively supporting the table format – CDC tools understand how to process CDC-generated records and apply them to the target tables. In this case, the CDC engine writes to the target table without another engine in between.
Without support for the two CDC options, processing and applying CDC records correctly into a target table will require custom code. With a CDC engine, each tool likely has its own CDC record format (or payload). For example, Debezium and AWS Database Migration Service (AWS DMS) each have their own specific record formats, and need to be transformed differently. This must be considered when you are operating CDC at scale across many tables.
All three table formats allow you to implement CDC from a source database into a target table. The difference for CDC with each format lies mainly in the ease of implementing CDC pipelines and supported integrations.
. | Apache Hudi | Apache Iceberg | Delta Lake |
Functional fit |
|
|
|
Considerations |
|
|
|
Natively supported CDC formats |
|
|
|
CDC tool integrations |
|
|
|
Conclusion | All three formats can implement CDC workloads. Apache Hudi offers the best overall technical fit for CDC workloads as well as the most options for efficient CDC pipeline design: no-code/low-code with DeltaStreamer, third-party CDC tools offering native Hudi integration, or a Spark/Flink engine using CDC record payloads offered in Hudi. |
Batch loads
If your use case requires only periodic writes but frequent reads, you may want to use batch loads and optimize for read performance.
Batch loading data with updates and deletes is perhaps the simplest use case to implement with any of the three table formats. Batch loads typically don’t require low latency, allowing them to benefit from the operational simplicity of a Copy On Write strategy. With Copy On Write, data files are rewritten to apply updates and add new records, minimizing the complexity of having to run compaction or optimization table services on the table.
. | Apache Hudi | Apache Iceberg | Delta Lake |
Functional fit |
|
|
|
Considerations |
|
|
|
Supported AWS integrations |
|
|
|
Conclusion | All three formats are well suited for batch loads. Apache Hudi supports the most configuration options and may increase the effort to get started, but provides lower operational effort due to automatic table management. On the other hand, Iceberg and Delta are simpler to get started with, but require some operational overhead for table maintenance. |
Working with open table formats
In this section, we discuss the capabilities of each open table format for common use cases when working with open table formats: optimizing read performance, incremental data processing and processing deletes to comply with privacy regulations.
Optimizing read performance
The preceding sections primarily focused on write performance for specific use cases. Now let’s explore how each open table format can support optimal read performance. Although there are some cases where data is optimized purely for writes, read performance is typically a very important dimension on which you should evaluate an open table format.
Open table format features that improve query performance include the following:
- Indexes, (column) statistics, and other metadata – Improves query planning and file pruning, resulting in reduced data scanned
- File layout optimization – Enables query performance:
- File size management – Properly sized files provide better query performance
- Data colocation (through clustering) according to query patterns – Reduces the amount of data scanned by queries
. | Apache Hudi | Apache Iceberg | Delta Lake |
Functional fit |
|
|
|
Considerations |
|
|
|
Optimization & Maintenance Processes |
|
|
|
Conclusion | For achieving good read performance, it’s important that your query engine supports the optimization features offered by the table formats. When using Spark, all three formats provide good read performance when properly configured. When using Trino (and therefore Athena as well), Iceberg will likely provide better query performance because the data skipping feature of Hudi and Delta is not supported in the Trino engine. Make sure to evaluate this feature support for your query engine of choice. |
Incremental processing of data on the data lake
At a high level, incremental data processing is the movement of new or fresh data from a source to a destination. To implement incremental extract, transform, and load (ETL) workloads efficiently, we need to be able to retrieve only the data records that have been changed or added since a certain point in time (incrementally) so we don’t need to reprocess unnecessary data (such as entire partitions). When your data source is an open table format table, we can take advantage of incremental queries to facilitate more efficient reads in these table formats.
. | Apache Hudi | Apache Iceberg | Delta Lake |
Functional fit |
|
|
|
Considerations |
|
|
|
Supported AWS integrations | Incremental queries are supported in:
|
Incremental queries supported in:
CDC view supported in:
|
CDF supported in:
|
Conclusion | Best functional fit for incremental ETL pipelines using a variety of engines, without any storage overhead. | Good fit for implementing incremental pipelines using Spark if the overhead of creating views is acceptable. | Good fit for implementing incremental pipelines using Spark if the additional storage overhead is acceptable. |
Processing deletes to comply with privacy regulations
Due to privacy regulations like the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), companies across many industries need to perform record-level deletes on their data lake for “right to be forgotten” or to correctly store changes to consent on how their customers’ data can be used.
The ability to perform record-level deletes without rewriting entire (or large parts of) datasets is the main requirement for this use case. For compliance regulations, it’s important to perform hard deletes (deleting records from the table and physically removing them from Amazon S3).
. | Apache Hudi | Apache Iceberg | Delta Lake |
Functional fit | Hard deletes are performed by Hudi’s automatic cleaner service. | Hard deletes can be implemented as a separate process. | Hard deletes can be implemented as a separate process. |
Considerations | Hudi cleaner needs to be configured according to compliance requirements to automatically remove older file versions in time (within a compliance window), otherwise time travel or rollback operations could recover deleted records. | Previous snapshots need to be (manually) expired after the delete operation, otherwise time travel operations could recover deleted records. | The vacuum operation needs to be run after the delete, otherwise time travel operations could recover deleted records. |
Conclusion | This use case can be implemented using all three formats, and in each case, you must ensure that your configuration or background pipelines implement the cleanup procedures required to meet your data retention requirements. |
Conclusion
Today, no single table format is the best fit for all use cases, and each format has its own unique strengths for specific requirements. It’s important to determine which requirements and use cases are most crucial and select the table format that best meets those needs.
To speed up the selection process of the right table format for your workload, we recommend the following actions:
- Identify what table format is likely a good fit for your workload using the high-level guidance provided in this post
- Perform a proof of concept with the identified table format from the previous step to validate its fit for your specific workload and requirements
Keep in mind that these open table formats are open source and rapidly evolve with new features and enhanced or new integrations, so it can be valuable to also take into consideration product roadmaps when deciding on the format for your workloads.
AWS will continue to innovate on behalf of our customers to support these powerful file formats and to help you be successful with your advanced use cases for analytics in the cloud. For more support on building transactional data lakes on AWS, get in touch with your AWS Account Team, AWS Support, or review the following resources:
About the Authors
Shana Schipers is an Analytics Specialist Solutions Architect at AWS, focusing on big data. She supports customers worldwide in building transactional data lakes using open table formats like Apache Hudi, Apache Iceberg and Delta Lake on AWS.
Ian Meyers is a Director of Product Management for AWS Analytics Services. He works with many of AWS largest customers on emerging technology needs, and leads several data and analytics initiatives within AWS including support for Data Mesh.
Carlos Rodrigues is a Big Data Specialist Solutions Architect at AWS. He helps customers worldwide building transactional data lakes on AWS using open table formats like Apache Hudi and Apache Iceberg.