Apache Hadoop on Amazon EMR

Why Apache Hadoop on EMR?

Apache™ Hadoop® is an open source software project that can be used to efficiently process large datasets. Instead of using one large computer to process and store the data, Hadoop allows clustering commodity hardware together to analyze massive data sets in parallel.

There are many applications and execution engines in the Hadoop ecosystem, providing a variety of tools to match the needs of your analytics workloads. Amazon EMR makes it easy to create and manage fully configured, elastic clusters of Amazon EC2 instances running Hadoop and other applications in the Hadoop ecosystem.

Applications and frameworks in the Hadoop ecosystem

Hadoop commonly refers to the actual Apache Hadoop project, which includes MapReduce (execution framework), YARN (resource manager), and HDFS (distributed storage). You can also install Apache Tez, a next-generation framework which can be used instead of Hadoop MapReduce as an execution engine. Amazon EMR also includes EMRFS, a connector allowing Hadoop to use Amazon S3 as a storage layer.

However, there are also other applications and frameworks in the Hadoop ecosystem, including tools that enable low-latency queries, GUIs for interactive querying, a variety of interfaces like SQL, and distributed NoSQL databases. The Hadoop ecosystem includes many open source tools designed to build additional functionality on Hadoop core components, and you can use Amazon EMR to easily install and configure tools such as Hive, Pig, Hue, Ganglia, Oozie, and HBase on your cluster. You can also run other frameworks, like Apache Spark for in-memory processing, or Presto for interactive SQL, in addition to Hadoop on Amazon EMR.

Hadoop: the basic components

Amazon EMR programmatically installs and configures applications in the Hadoop project, including Hadoop MapReduce, YARN, HDFS, and Apache Tez across the nodes in your cluster.

Hadoop MapReduce and Tez, execution engines in the Hadoop ecosystem, process workloads using frameworks that break down jobs into smaller pieces of work that can be distributed across nodes in your Amazon EMR cluster. They are built with the expectation that any given machine in your cluster could fail at any time and are designed for fault tolerance. If a server running a task fails, Hadoop reruns that task on another machine until completion.

You can write MapReduce and Tez programs in Java, use Hadoop Streaming to execute custom scripts in a parallel fashion, utilize Hive and Pig for higher level abstractions over MapReduce and Tez, or other tools to interact with Hadoop.

Starting with Hadoop 2, resource management is managed by Yet Another Resource Negotiator (YARN). YARN keeps track of all the resources across your cluster, and it ensures that these resources are dynamically allocated to accomplish the tasks in your processing job. YARN is able to manage Hadoop MapReduce and Tez workloads as well as other distributed frameworks such as Apache Spark.

By using the EMR File System (EMRFS) on your Amazon EMR cluster, you can leverage Amazon S3 as your data layer for Hadoop. Amazon S3 is highly scalable, low cost, and designed for durability, making it a great data store for big data processing. By storing your data in Amazon S3, you can decouple your compute layer from your storage layer, allowing you to size your Amazon EMR cluster for the amount of CPU and memory required for your workloads instead of having extra nodes in your cluster to maximize on-cluster storage. Additionally, you can terminate your Amazon EMR cluster when it is idle to save costs, while your data remains in Amazon S3.

EMRFS is optimized for Hadoop to directly read and write in parallel to Amazon S3 performantly, and can process objects encrypted with Amazon S3 server-side and client-side encryption. EMRFS allows you to use Amazon S3 as your data lake, and Hadoop in Amazon EMR can be used as an elastic query layer.

Hadoop also includes a distributed storage system, the Hadoop Distributed File System (HDFS), which stores data across local disks of your cluster in large blocks. HDFS has a configurable replication factor (with a default of 3x), giving increased availability and durability. HDFS monitors replication and balances your data across your nodes as nodes fail and new nodes are added.

HDFS is automatically installed with Hadoop on your Amazon EMR cluster, and you can use HDFS along with Amazon S3 to store your input and output data. You can easily encrypt HDFS using an Amazon EMR security configuration. Also, Amazon EMR configures Hadoop to uses HDFS and local disk for intermediate data created during your Hadoop MapReduce jobs, even if your input data is located in Amazon S3.

Advantages of Hadoop on Amazon EMR

You can initialize a new Hadoop cluster dynamically and quickly, or add servers to your existing Amazon EMR cluster, significantly reducing the time it takes to make resources available to your users and data scientists. Using Hadoop on the AWS platform can dramatically increase your organizational agility by lowering the cost and time it takes to allocate resources for experimentation and development.

Hadoop configuration, networking, server installation, security configuration, and ongoing administrative maintenance can be a complicated and challenging activity. As a managed service, Amazon EMR addresses your Hadoop infrastructure requirements so you can focus on your core business.

You can easily integrate your Hadoop environment with other services such as Amazon S3Amazon KinesisAmazon Redshift, and Amazon DynamoDB to enable data movement, workflows, and analytics across the many diverse services on the AWS platform. Additionally, you can use the AWS Glue Data Catalog as a managed metadata repository for Apache Hive and Apache Spark.

Many Hadoop jobs are spiky in nature. For instance, an ETL job can run hourly, daily, or monthly, while modeling jobs for financial firms or genetic sequencing may occur only a few times a year. Using Hadoop on Amazon EMR allows you to spin up these workload clusters easily, save the results, and shut down your Hadoop resources when they’re no longer needed, to avoid unnecessary infrastructure costs. EMR 6.x supports Hadoop 3, which allows the YARN NodeManager to launch containers either directly on the EMR cluster host or inside a Docker container. Please see our documentation to learn more.

By using Hadoop on Amazon EMR, you have the flexibility to launch your clusters in any number of Availability Zones in any AWS region. A potential problem or threat in one region or zone can be easily circumvented by launching a cluster in another zone in minutes.

Capacity planning prior to deploying a Hadoop environment can often result in expensive idle resources or resource limitations. With Amazon EMR, you can create clusters with the required capacity within minutes and use EMR Managed Scaling to dynamically scale out and scale in nodes.

How are Hadoop and big data related?

Hadoop is commonly used to process big data workloads because it is massively scalable. To increase the processing power of your Hadoop cluster, add more servers with the required CPU and memory resources to meet your needs.

Hadoop provides a high level of durability and availability while still being able to process computational analytical workloads in parallel. The combination of availability, durability, and scalability of processing makes Hadoop a natural fit for big data workloads. You can use Amazon EMR to create and configure a cluster of Amazon EC2 instances running Hadoop within minutes, and begin deriving value from your data.

Use cases

Apache and Hadoop are trademarks of the Apache Software Foundation.

Hadoop can be used to analyze clickstream data in order to segment users and understand user preferences. Advertisers can also analyze clickstreams and advertising impression logs to deliver more effective ads.

Learn how Razorfish uses Hadoop on Amazon EMR for clickstream analysis

Hadoop can be used to process logs generated by web and mobile applications. Hadoop helps you turn petabytes of un-structured or semi-structured data into useful insights about your applications or users.

Learn how Yelp uses Hadoop on Amazon EMR to drive key website features

 

Hadoop ecosystem applications like Hive allow users to leverage Hadoop MapReduce using a SQL interface, enabling analytics at a massive scale, distributed, and fault-tolerant data warehousing. Use Hadoop to store your data and allow your users to send queries at data of any size.

Watch how Netflix uses Hadoop on Amazon EMR to run a petabyte scale data warehouse

Hadoop can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. AWS has made the 1000 Genomes Project data publicly available to the community free of charge.

Read more about Genomics on AWS

 

Given its massive scalability and lower costs, Hadoop is ideally suited for common ETL workloads such as collecting, sorting, joining, and aggregating big datasets for easier consumption by downstream systems.

Read how Euclid uses Hadoop on Amazon EMR for ETL and data aggregation