AWS Storage Blog

How Amazon Ads uses Iceberg optimizations to accelerate their Spark workload on Amazon S3

In today’s data-driven business landscape, organizations are increasingly relying on massive data lakes to store, process, and analyze vast amounts of information. However, as these data repositories grow to petabyte scale, a key challenge for businesses is implementing transactional capabilities on their data lakes efficiently. The sheer volume of data requires immense computational power and time to pre-process, ingest, and access, often resulting in increased costs and longer processing times.

Many organizations are turning to open table formats like Apache Iceberg, which are designed to support database-like features required by transactional workloads. Iceberg has gained popularity for data lakes because it is fast, efficient, and reliable at any scale. At the same time, Amazon S3 is ideal for data lake storage because it is secure, scalable, durable, and highly available. However, optimizing performance for large-scale operations remains a critical concern for businesses.

Amazon Ads faces these challenges head-on as they operate large transactional data lakes on S3. They frequently build and backfill petabyte-scale Iceberg tables into their S3 data lake to meet evolving business needs. In October 2024, Iceberg introduced a default base-2 object store file layout, which helps optimize request scaling performance of large-scale Iceberg workloads running on S3.

In this post, we discuss how Amazon Ads used Iceberg’s new base-2 file layout for S3 to accelerate their Apache Spark workload that runs on Amazon EMR and populates Iceberg tables in S3. In their experiments, they demonstrate how they minimized EMR job failures and manual work, resulting in a 22% decrease in total EMR processing time, and saving up to 20% on compute costs and 32% on storage costs.

How Amazon Ads’ workload is structured

Amazon Ads manages their S3 data lake with Apache Spark jobs running on EMR. To create a brand-new partitioned Iceberg table, they first create a new S3 bucket and prefix. Apache Spark on EMR writes data to S3 in distributed tasks. While backfilling newly created Iceberg tables, each Spark job processes an Iceberg table partition’s worth of data and writes newly processed data to S3 using multipart uploads.

At a high level, each Spark job follows these steps:

  1. Data loading and pre-processing: The input data files, which can be in any structured format (such as CSV, Apache Parquet, JSON, etc.), are loaded into the Spark job and various ETL tasks like joins, transformations, and shuffling are performed.
  2. Writes to S3: Each Spark job creates and writes over 100k data files as objects to S3, issuing up to tens of thousands of multipart upload operations to S3 per second. These data files are generally in Parquet or Optimized Row Columnar (ORC) formats.
  3. Commit phase: Spark tries to atomically commit the files to make it part of the Iceberg table in the S3 data lake. During this phase, Spark will also create and update metadata files in S3, namely the manifest files and manifest lists, and update AWS Glue with the latest manifest file. The metadata files keep track of information such as the schema of the Iceberg tables, the partition strategy, and the location of the data files, as well as column-level statistics such as minimum and maximum ranges for the records that are stored in each data file.
  4. Retries: If errors occur during any of the prior phases, the Spark job will attempt to automatically retry the tasks that failed, and escalate to a job-level failure if the maximum retry count is reached. The Spark job must then be manually re-triggered to start from the very beginning.
  5. Orphan files: Any files that were written but not included in the commit due to partial writes, task retries, or job failure before the commit phase is completed will remain in S3 without becoming part of the Iceberg table. This creates ‘orphan files’ in S3 that need to be deleted or managed separately.

The following diagram portrays how the Spark jobs are processed.

Figure 1: High-level depiction of how the Spark jobs are processed

Figure 1: High-level depiction of how the Spark jobs are processed

In Amazon Ad’s case, each automatic retry within Spark slows the overall job down, and requires the operator to manually monitor and re-trigger Spark jobs that were escalated to a terminal failure. As the write to S3 is the final phase of the ETL process, a failure at that point wastes both time and compute as they must then manually retry the Spark job from the very beginning, as shown in Figure 2.

In Amazon Ad’s case, each automatic retry within Spark slows the overall job down, and requires the operator to manually monitor and re-trigger Spark jobs that were escalated to a terminal failure. As the write to S3 is the final phase of the ETL process, a failure at that point wastes both time and compute as they must then manually retry the Spark job from the very beginning, as shown in Figure 2.

Figure 2: Additional Time is spent retrying local or job-level failures

In the following section, we focus on how Amazon Ads optimizes S3 request scaling during the write portion of the Spark job. We will also demonstrate how they used the new base-2 entropy in Iceberg to minimize job failures due to writes, reducing EMR processing time and costs, and improving the overall operational efficiency of their workload.

Optimizing request scaling while writing to S3

While populating a new Iceberg table created in a new S3 prefix, Amazon Ads can easily issue tens of thousands of write requests to S3 per second. S3 is highly distributed, and initially allows applications to achieve at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per partitioned S3 prefix in a given bucket. To support higher customer request traffic such as that of Amazon Ads’, S3 automatically scales by adding request capacity. However, while S3 is adding request capacity, customers may temporarily receive HTTP 503 request responses until the optimization completes. To take advantage of S3’s request scaling and to accelerate their end-to-end EMR job processing time, Amazon Ads explicitly introduces entropy in object key names by enabling the object store file layout configuration (write.object-storage.enabled) in Iceberg.

Iceberg uses a hash-based file layout to evenly spread traffic across many randomized S3 prefixes. Previously, when the object store file layout configuration was enabled, Iceberg would add a 6-character base-64 hash (e.g., Bn2Hza) to the object key names. Now, with Iceberg v1.7.0, Iceberg has updated the default file layout to a 20-character base-2 hash (e.g., 01010110100110110010) to help customers scale their S3 requests more efficiently by optimizing key distribution for hierarchical scaling. This binary representation allows for a more even distribution of keys across the storage hierarchy, enabling better parallelization and faster data retrieval as systems scale up. As part of this update, Iceberg has also divided the hash entropy into multiple directories. Directories divide the request traffic across workers for faster traversal, improving the efficiency of the orphan file cleanup process for Iceberg. For more details about the hash entropy change in the file layout and example code snippets, refer to the Iceberg AWS Integrations documentation. EMR has also made the aforementioned changes available early with the bundled Iceberg v1.6.1-amzn-0 in EMR v7.4.

It’s important to note that using the object store file layout configuration in Iceberg is not a foolproof mechanism to avoid S3 503 errors. You still need to retry S3 requests receiving HTTP 503 errors so that the workload is able to progress after S3 has completed auto-scaling. Starting with Iceberg v1.7.0, Iceberg has provided configurable parameters for S3 retries and also introduced better defaults to prevent fast failures. While the default retry value in Iceberg v1.7.0 is set to 5, customers such as Amazon Ads that require high throughput, can set the retry value higher to fit their use case. For more details about the retry configurations, refer to the Iceberg AWS Integrations documentation.

In the next section, we demonstrate how Amazon Ads tested the new base-2 file layout in Iceberg to accelerate their Spark workload on S3.

Testing the Iceberg enhancement for S3

In this real world demonstration, Amazon Ads created a new Iceberg table and seeded it into a new S3 prefix with 570TB of historical data. The historical data spans 456 Iceberg table partitions and includes 19 days (or 456 hours) worth of data specific to Amazon Ads’ use case. The seeding process is accomplished through 456 individual EMR steps (or Spark jobs), where each step attempts to process and commit one hour (up to ~20TB of uncompressed data or about 2TB of compressed data) to the Iceberg table. Amazon Ads found that retrying S3 requests that receive 503 slowdown errors up to 32 times was suitable for their use case. Each EMR step takes around 30 minutes on average to complete and is manually re-triggered in the case of terminal job failures. The objective is to complete all 456 steps as soon as possible, with as little manual intervention and fewest orphan files created in the process as possible.

They ran the experiment twice and compared the results. In the first run, they used the base-64 file layout. In the second run, they updated to use the new base-2 file layout in Iceberg. Compared to the first run, Amazon Ads saw 22% reduction in total EMR processing time and 20% reduction in EMR compute costs, zero cases of manual intervention, and 32% reduction in storage costs due to fewer orphan files created in the second run.

Results from the first run:

Using the base-64 file layout in versions prior to Iceberg bundled with EMR v7.4 release, it took 11.5 hours to perform the end-to-end table seeding process for backfilling 19 days’ worth of data. Amazon Ads sent up to 10,000 write requests per second, which is roughly 3 times S3’s original request capacity of 3,500 PUT/COPY/POST/DELETE requests per second per partitioned prefix. Approximately 50% of EMR jobs were manually retried at some point due to 5xx errors while S3 was adding more request capacity.

Figure 3: Total time spent and EMR failures during the first run of the experiment

Figure 3: Total time spent and EMR failures during the first run of the experiment

The S3 error rates decreased gradually over time.

Figure 4: S3 5xx errors decreased gradually over time during the first run of the experiment

Figure 4: S3 5xx errors decreased gradually over time during the first run of the experiment

Results from the second run:

Amazon Ads re-ran the same experiment after updating to Iceberg bundled with EMR v7.4, which has the new binary (base-2) file layout. The same EMR seeding process took just 9 hours to run, which is 22% faster than the first run. This was a significant improvement for Amazon Ads, because it removes the need to add additional compute to achieve the same 22% acceleration. They saw 77% fewer 5xx errors from S3 and zero EMR step-level failures. As a result, no manual intervention was required in this run of the experiment.

Figure 5: 22% less time spent and zero EMR failures during the second run of the experiment

Figure 5: 22% less time spent and zero EMR failures during the second run of the experiment

Notably, the S3 errors subsided within just an hour after the seeding operation beginning. The additional time saved resulted in compute savings of 20%. Amazon Ads also saw a 74% reduction in excess data uploaded to S3 as orphan files, resulting in 32% of storage cost savings from the original run.

Figure 6: S3 5xx errors dropped off within the first hour during the second run of the experiment

Figure 6: S3 5xx errors dropped off within the first hour during the second run of the experiment

Conclusion

In this post, we discussed a new base-2 file layout in Iceberg, which helps optimize request scaling performance of large-scale Iceberg workloads running on Amazon S3. Using the base-2 file layout, we demonstrated how Amazon Ads minimized manual work in their workload and reduced the total compute time of their EMR job by 22% (11.5 hours compared to 9 hours). This helped them achieve a 20% reduction in compute costs and 32% reduction in S3 storage costs.

Learn more about the new base-2 configuration in Apache Iceberg or in Apache Iceberg bundled with EMR. For general guidance on optimizing Iceberg workloads built on S3 and AWS, refer to the following blog posts:

  1. Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes
  2. Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog
  3. Amazon EMR 7.1 runtime for Apache Spark and Iceberg can run Spark workloads 2.7 times faster than Apache Spark 3.5.1 and Iceberg 1.5.2
  4. Apache Iceberg optimization: Solving the small files problem in Amazon EMR
Zach Dischner

Zach Dischner

Zach Dischner is a Senior Software Engineer in Amazon Advertising, specializing in all things big data. He works on cutting edge Iceberg-based data lakes and applications and enjoys building highly scalable solutions on AWS. Outside of work he is an avid audiobook consumer and loves to travel, with a bent towards kitesurfing and mountain biking destinations.

Dhanika Sujan

Dhanika Sujan

Dhanika Sujan is a Product Manager on the Amazon S3 team at AWS. She enjoys hearing from customers and helping them simplify their storage solutions on AWS. Dhanika is based in Seattle, and outside of work likes to travel, read, explore astronomy, and discover new music.

Mehar Sawhney

Mehar Sawhney

Mehar Sawhney is a Software Development Manager at Amazon, working on advanced cloud storage solutions. She leads initiatives to improve data accessibility and processing for artificial intelligence and large-scale analytics. Outside of work, Mehar is an avid runner and winter sports enthusiast. She balances her tech-focused career with creative pursuits like baking and enjoys introducing friends to the invigorating practice of cold plunging.

 Ozan Okumusoglu

 Ozan Okumusoglu

 Ozan Okumusoglu is a Principal Software Engineer on the Amazon S3 team at AWS. With a passion for tackling complex challenges in large-scale distributed systems, Ozan enjoys the dynamic environment of cloud computing. Outside of work, Ozan likes to stay active by going to the gym and playing his favorite sports: soccer and tennis. As time permits, he also enjoys unwinding with computer games and catching up on movies of all genres.

Garry Galinsky

Garry Galinsky

Garry Galinsky is a Solutions Architecture at Amazon Web Services.He has over 20 years of progressive high tech experience in the Telecommunications, Local Search, Internet, and Financial industries. During his career, Garry has successfully designed, developed, delivered, and evangelized multiple Internet, web 2.0, local & social search, and mobile solutions to small, medium, and large corporate customers.