Amazon EMR | AWS Big Data Blog

How to migrate a Hue database from an existing Amazon EMR cluster

This post describes the step-by-step process for migrating the Hue database from an existing EMR cluster.

Easily manage table metadata for Presto running on Amazon EMR using the AWS Glue Data Catalog

In this post, we will explore how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR.

Build a Multi-Tenant Amazon EMR Cluster with Kerberos, Microsoft Active Directory Integration and IAM Roles for EMRFS

In this post, we will discuss what EMRFS authorization is (Amazon S3 storage-level access control) and show how to configure the role mappings with detailed examples.

Dynamically Create Friendly URLs for Your Amazon EMR Web Interfaces

This solution provides a serverless approach to automatically assigning a friendly name for your EMR cluster for easy access to popular notebooks and other web interfaces.

Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory

This post walks you through the process of using AWS CloudFormation to set up a cross-realm trust and extend authentication from an Active Directory network into an Amazon EMR cluster with Kerberos enabled. By establishing a cross-realm trust, Active Directory users can use their Active Directory credentials to access an Amazon EMR cluster and run jobs as themselves.

Custom Log Presto Query Events on Amazon EMR for Auditing and Performance Insights

In this blog post, we will demonstrate how to implement and install a Presto event listener for purposes of custom logging, debugging and performance analysis for queries executed on an EMR cluster.

Genomic Analysis with Hail on Amazon EMR and Amazon Athena

For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

Create Custom AMIs and Push Updates to a Running Amazon EMR Cluster Using Amazon EC2 Systems Manager

In this post, I show how Systems Manager Automation can be used to automate the creation and patching of custom Amazon Linux AMIs for EMR. I also show how you can use Run Command to send commands to all nodes of a running EMR cluster.

Building a Real World Evidence Platform on AWS

Deriving insights from large datasets is central to nearly every industry, and life sciences is no exception. To combat the rising cost of bringing drugs to market, pharmaceutical companies are looking for ways to optimize their drug development processes. They are turning to big data analytics to better quantify the effect that their drug compounds […]

Turbocharge your Apache Hive Queries on Amazon EMR using LLAP

NOTE: Starting from emr-6.0.0 release, Hive LLAP is officially supported as a YARN service. So setting up LLAP using the instructions from this blog post (using a bootstrap action script) is not needed for releases emr-6.0.0 and onward. ——————————- Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop […]

Category: Amazon EMR