AWS Big Data Blog

Tag: Amazon EMR

Exploring Geospatial Intelligence using SparkR on Amazon EMR

Gopal Wunnava is a Senior Consultant with AWS Professional Services The number of data sources that use location, such as smartphones and sensory devices used in IoT (Internet of things), is expanding rapidly. This explosion has increased demand for analyzing spatial data. Geospatial intelligence (GEOINT) allows you to analyze data that has geographical or spatial […]

Will Spark Power the Data behind Precision Medicine?

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services. This post was co-authored by Ujjwal Ratan, a Solutions Architect with Amazon Web Services. ——————————— “And that’s the promise of precision medicine — delivering the right treatments, at the right time, every time to the right person.“ (President Obama, 2015 State […]

Crunching Statistics at Scale with SparkR on Amazon EMR

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services. This post is co-authored by Gopal Wunnava, a Senior Consultant with AWS Professional Services. SparkR is an R package that allows you to integrate complex statistical analysis with large datasets. In this blog post, we introduce you running R with the […]

Anomaly Detection Using PySpark, Hive, and Hue on Amazon EMR

Veronika Megler, Ph.D., is a Senior Consultant with AWS Professional Services We are surrounded by more and more sensors – some of which we’re not even consciously aware. As sensors become cheaper and easier to connect, they create an increasing flood of data that’s getting cheaper and easier to store and process. However, sensor readings […]

Import Zeppelin notes from GitHub or JSON in Zeppelin 0.5.6 on Amazon EMR

Jonathan Fritz is a Senior Product Manager for Amazon EMR Many Amazon EMR customers use Zeppelin to create interactive notebooks to run workloads with Spark using Scala, Python, and SQL. These customers have found Amazon EMR to be a great platform for running Zeppelin because of strong integration with other AWS services and the ability […]

Analyze Your Data on Amazon DynamoDB with Apache Spark

Manjeet Chayel is a Solutions Architect with AWS Every day, tons of customer data is generated, such as website logs, gaming data, advertising data, and streaming videos. Many companies capture this information as it’s generated and process it in real time to understand their customers. Amazon DynamoDB is a fast and flexible NoSQL database service […]

Submitting User Applications with spark-submit

Francisco Oliveira is a consultant with AWS Professional Services Customers starting their big data journey often ask for guidelines on how to submit user applications to Spark running on Amazon EMR. For example, customers ask for guidelines on how to size memory and compute resources available to their applications and the best resource allocation model […]

Turning Amazon EMR into a Massive Amazon S3 Processing Engine with Campanile

Michael Wallman is a senior consultant with AWS ProServ Have you ever had to copy a huge Amazon S3 bucket to another account or region? Or create a list based on object name or size? How about mapping a function over millions of objects? Amazon EMR to the rescue! EMR allows you to deploy large […]

Running an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR

Dominic Murphy is an Enterprise Solution Architect with Amazon Web Services Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results. Zeppelin notebooks can be shared among several users, […]