Apache Hadoop on Amazon EMR
Why Apache Hadoop on EMR?
Apache™ Hadoop® is an open source software project that can be used to efficiently process large datasets. Instead of using one large computer to process and store the data, Hadoop allows clustering commodity hardware together to analyze massive data sets in parallel.
There are many applications and execution engines in the Hadoop ecosystem, providing a variety of tools to match the needs of your analytics workloads. Amazon EMR makes it easy to create and manage fully configured, elastic clusters of Amazon EC2 instances running Hadoop and other applications in the Hadoop ecosystem.
How are Hadoop and big data related?
Hadoop is commonly used to process big data workloads because it is massively scalable. To increase the processing power of your Hadoop cluster, add more servers with the required CPU and memory resources to meet your needs.
Hadoop provides a high level of durability and availability while still being able to process computational analytical workloads in parallel. The combination of availability, durability, and scalability of processing makes Hadoop a natural fit for big data workloads. You can use Amazon EMR to create and configure a cluster of Amazon EC2 instances running Hadoop within minutes, and begin deriving value from your data.