AWS Big Data Blog
Category: Amazon EMR
Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS
September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created […]
Secure Amazon EMR with Encryption
In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed […]
Create a Healthcare Data Hub with AWS and Mirth Connect
As anyone visiting their doctor may have noticed, gone are the days of physicians recording their notes on paper. Physicians are more likely to enter the exam room with a laptop than with paper and pen. This change is the byproduct of efforts to improve patient outcomes, increase efficiency, and drive population health. Pushing for […]
Serving Real-Time Machine Learning Predictions on Amazon EMR
The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. We want to highlight one of many ways to host those recommendations (for example, see the Analyzing Genomics Data at Scale using R, […]
Run Jupyter Notebook and JupyterHub on Amazon EMR
NOTE: Please note that as of EMR 5.14.0, JupyterHub is an officially supported application. We recommend you use the most recent version of EMR if you would like to run JupyterHub on EMR. In addition, EMR Notebooks allow you to create and open Jupyter notebooks with the Amazon EMR console. We will not provide any […]
Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events
Jonathan Fritz is a Senior Product Manager for Amazon EMR Customers can take advantage of the Amazon EMR API to create and terminate EMR clusters, scale clusters using Auto Scaling or manual resizing, and submit and run Apache Spark, Apache Hive, or Apache Pig workloads. These decisions are often triggered from cluster state-related information. Previously, […]
Using SaltStack to Run Commands in Parallel on Amazon EMR
Miguel Tormo is a Big Data Support Engineer in AWS Premium Support Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. Amazon EMR defines three types of nodes: master node, core nodes, and task nodes. It’s common to […]
Implementing Authorization and Auditing using Apache Ranger on Amazon EMR
Updated 3/30/2022: Amazon EMR has announced official support of Apache Ranger (link). Open-source plugin support will not be maintained moving forward and compatibility with latest versions will not be tested. We recommend customers to move to the Amazon EMR support for Apache Ranger. Ranger Presto plugin support on EMR has been deprecated. Updated 12/03/2020: Support for […]
Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3
John Hitchingham is Director of Performance Engineering at FINRA The Financial Industry Regulatory Authority (FINRA) is a private sector regulator responsible for analyzing 99% of the equities and 65% of the option activity in the US. In order to look for fraud, market manipulation, insider trading, and abuse, FINRA’s technology group has developed a robust […]
Dynamically Scale Applications on Amazon EMR with Auto Scaling
Jonathan Fritz is a Senior Product Manager for Amazon EMR Customers running Apache Spark, Presto, and the Apache Hadoop ecosystem take advantage of Amazon EMR’s elasticity to save costs by terminating clusters after workflows are complete and resizing clusters with low-cost Amazon EC2 Spot Instances. For instance, customers can create clusters for daily ETL or machine learning […]