AWS Big Data Blog
Using IPython Notebook to Analyze Data with Amazon EMR
Manjeet Chayel is a Solutions Architect with AWS
IPython Notebook is a web-based interactive environment that lets you combine code, code execution, mathematical functions, rich documentation, plots, and other elements into a single document. In the background, IPython Notebook stores this information as a JSON document.
The main advantage of a notebook when compared to a traditional REPL or a traditional write/upload/test workflow is that you can mix interactive contents with images and plots. In the context of data analysis, it might be very useful to do some exploratory plotting, prepare quick prototypes, and share it with colleagues. This is where IPython shines: when you share a notebook, you share the code in an organized manner that provides the context and enables your colleagues to experiment. The IPython website has good documentation and examples to help you get started with IPython. By running a notebook on Amazon EMR, you can quickly do analytics on your dataset by running Hadoop jobs and then plotting the results.
This blog post shows you how to launch an EMR cluster running IPython Notebook, connect to a notebook from your browser, use Hadoop Streaming for analysis, and display the results on a graph. Using the IPython bootstrap action, you install IPython Notebook and its dependency on the master node, along with packages to do the basic scientific computations. (See AWS documentation for more information about bootstrap actions.)
You can launch an EMR cluster using the following command:
aws emr create-cluster --name iPythonNotebookEMR --ami-version 3.2.3 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=<<MYKEY>>> --use-default-roles --bootstrap-actions Path=s3://elasticmapreduce.bootstrapactions/ipython-notebook/install-ipython-notebook,Name=Install_iPython_NB --termination-protected
Note: If you have not created default IAM Roles for EMR, you can do so using the EMR
create-default-roles command. On the AWS CLI version 1.7.17 or later, this command adds values to the AWS CLI config file that specify the default IAM roles (service role and instance profile) for use in the create-cluster command. If you specify these values in the AWS CLI config file, you don’t need to include the –use-defualt-roles shortcut in your create-cluster command as shown in the example above.
You can also use the EMR console and select the following bootstrap action:
s3://elasticmapreduce.bootstrapactions/ipython-notebook/install-ipython-notebook
After the cluster is running, the notebook server runs on port 8192. You can connect to it by opening a tunnel from your local machine to your EMR master node. The following example shows how to open a tunnel to your master node:
ssh -o ServerAliveInterval=10 -i <<credentials.pem>> -N -L 8192:<<master-public-dns-name>>:8192 hadoop@<<master-public-dns-name>>
After you open the tunnel, open your browser and point to the following URL to access the notebook:
http://localhost:8192
After the page opens, choose New Notebook.
Download the Word Count Code
First, you download the word count code to your machine. You can use wget clubbed with the IPython feature to run shell commands from the page by prefixing the command with an (!) exclamation mark.
!wget https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
After downloading the word count code, you can execute it like any other Hadoop job. The input for this program is read from Amazon S3 and the output is written to HDFS.
Run the MapReduce Program
!hadoop jar /home/hadoop/contrib/streaming/hadoop-*streaming*.jar -files wordSplitter.py -mapper wordSplitter.py -reducer aggregate -input s3://elasticmapreduce/samples/wordcount/input -output /output
After the Hadoop job completes, you can check the output files sitting on local HDFS on your EMR cluster. The next step is to plot the results on a bar graph. The sample code can be downloaded from the AWS Big Data Blog github repository. (Remember to preserve spaces if you are copying and pasting the code!)
The screen shot below shows the results after running the sample code.
Clean Up
After you finish playing with IPython Notebook, you can terminate your cluster using the console or AWS CLI to avoid incurring additional costs.
Conclusion
In this post, you learned how to launch an EMR cluster running IPython Notebook, connect to a notebook from your browser, use Hadoop Streaming for analysis, and display the results on a graph. You can now use the power of IPython to organize your code and quickly share it so that others can easily understand the context and experiment.
If you have questions or suggestions, please leave a comment below.
—————————————————————-
EMR is hiring. . .Check out our job listings.
—————————————————————-
Related:
Getting Started with Elasticsearch and Kibana on EMR
Strategies for Reducing your EMR Costs
Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set
—————————————————————
Love to work on open source? Check out EMR’s careers page.