Amazon EMR | AWS Big Data Blog

Deploy Apache YuniKorn batch scheduler for Amazon EMR on EKS

This post explores Kubernetes scheduling fundamentals, examines the limitations of the default kube-scheduler for batch workloads, and demonstrates how YuniKorn addresses these challenges. We discuss how to deploy YuniKorn as a custom scheduler for Amazon EMR on EKS, its integration with job submissions, how to configure queues and placement rules, and how to establish resource quotas. We also show these features in action through practical Spark job examples.

How Ancestry optimizes a 100-billion-row Iceberg table

This is a guest post by Thomas Cardenas, Staff Software Engineer at Ancestry, in partnership with AWS. Ancestry, the global leader in family history and consumer genomics, uses family trees, historical records, and DNA to help people on their journeys of personal discovery. Ancestry has the largest collection of family history records, consisting of 40 […]

Improve Amazon EMR HBase availability and tail latency using generational ZGC

Large-scale HBase deployments on Amazon EMR suffer from unpredictable garbage collection behavior that creates performance bottlenecks for business-critical applications. To solve this problem, Amazon EMR leverages Oracle’s generational ZGC technology from JDK 21 to deliver predictable, sub-millisecond pause times. This post shows you how to configure generational ZGC in Amazon EMR 7.10.0, apply performance tuning methods, and optimize HBase RegionServer garbage collection settings.

Achieve low-latency data processing with Amazon EMR on AWS Local Zones

By deploying Amazon EMR on AWS Local Zones, organizations can achieve single-digit millisecond latency data processing for applications while maintaining data residency compliance. This post demonstrates how to use AWS Local Zones to deploy EMR clusters closer to your users, enabling millisecond-level response times.

Enhance Amazon EMR observability with automated incident mitigation using Amazon Bedrock and Amazon Managed Grafana

In this post, we demonstrate how to integrate real-time monitoring with AI-powered remediation suggestions, combining Amazon Managed Grafana for visualization, Amazon Bedrock for intelligent response recommendations, and AWS Systems Manager for automated remediation actions on Amazon Web Services (AWS).

Use Databricks Unity Catalog Open APIs for Spark workloads on Amazon EMR

In this post, we demonstrate the powerful interoperability between Amazon EMR and Databricks Unity Catalog by walking through how to enable external access to Unity Catalog, configure EMR Spark to connect seamlessly with Unity Catalog, and perform DML and DDL operations on Unity Catalog tables using EMR Serverless.

RocksDB 101: Optimizing stateful streaming in Apache Spark with Amazon EMR and AWS Glue

This post explores RocksDB’s key features and demonstrates its implementation using Spark on Amazon EMR and AWS Glue, providing you with the knowledge you need to scale your real-time data processing capabilities.

Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

In this post, we guide you through the process of creating a Data Catalog view using EMR Serverless, adding the SQL dialect to the view for Athena, sharing it with another account using LF-Tags, and then querying the view in the recipient account using a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the versatility and cross-account capabilities of Data Catalog views and access through various AWS analytics services.

Centralize Apache Spark observability on Amazon EMR on EKS with external Spark History Server

This post demonstrates how to centralize Apache Spark observability using SHS on EMR on EKS. We showcase how to enhance SHS with performance monitoring tools, with a pattern applicable to many monitoring solutions such as SparkMeasure and DataFlint.

Build a secure serverless streaming pipeline with Amazon MSK Serverless, Amazon EMR Serverless and IAM

The post demonstrates a comprehensive, end-to-end solution for processing data from MSK Serverless using an EMR Serverless Spark Streaming job, secured with IAM authentication. Additionally, it demonstrates how to query the processed data using Amazon Athena, providing a seamless and integrated workflow for data processing and analysis. This solution enables near real-time querying of the latest data processed from MSK Serverless and EMR Serverless using Athena, providing instant insights and analytics.

AWS Big Data Blog

Category: Amazon EMR