AWS Big Data Blog
Using Amazon EMR with SQL Workbench and other BI Tools
This is a guest post by Kyle Porter, a Sales Engineer at Simba Technologies.
Jon Einkauf, a Senior Product Manager for Amazon Elastic MapReduce and AWS Senior Technical Writer Jeff Slone also contributed to this post.
—————-
Note: Ports have changed on EMR 4.x,. Before walking through this post, please consult the EMR documentation to make sure you are connecting to the correct port.
—————-
Many customers use business intelligence tools like SQL Workbench to query data with Amazon Elastic MapReduce (Amazon EMR). These tools are simple to set up and make it easier to develop and run queries. This post shows you how to use SQL Workbench to query sample Amazon CloudFront access logs stored in Amazon Simple Storage Service (Amazon S3) using Hive 0.13 on Amazon EMR.
Amazon EMR employs drivers from Simba Technologies to connect to client JDBC and ODBC applications. The example in this blog post uses the Hive JDBC driver, but Amazon EMR also provides drivers for Hive ODBC, Impala JDBC, Impala ODBC, and HBase ODBC, so you can use a variety of other tools and applications. You can download these drivers. This tutorial assumes that you already have an Amazon EMR cluster with Hive running. If you aren’t sure how to do this, AWS provides instructions.
There are five easy steps to set everything up and start running your queries:
- Download and extract the Hive JDBC drivers
- Create an SSH tunnel to the Amazon EMR master node
- Use SQL Workbench to connect to the master node using the JDBC driver
- Connect via JDBC
- Create a Hive table and query the data
Download and Extract the Hive JDBC Drivers
- Download the Hive JDBC driver.
- Extract HiveJDBC.zip into a suitable directory such as HiveJDBC).
- Navigate to HiveJDBCAmazonHiveJDBC1.0.0.1000 and extract Amazon_HiveJDBC3_1.0.0.1000.zip.
Create an SSH Tunnel to the Amazon EMR Master Node
You can create a secure connection to an Amazon EMR master node using an SSH client such as PuTTY. Using an SSH client, you can also create a tunnel to connect to a specific port on the master node. To establish a connection with the Hive .13 JDBC port on the master node, you must create an SSH tunnel to port 10000. For Hive .11, you must create an SSH tunnel to port 10004.
Step 1: Retrieve the master public DNS name
Before establishing your SSH connection, you must retrieve the public DNS name of your cluster’s master node. The master public DNS name is located at the top of the Cluster Detail page in the console.
See AWS documentation for detailed instructions on retrieving the master public DNS name.
Note: You can also retrieve the master public DNS using the AWS CLI by typing the following command:
aws emr describe-cluster –cluster-id xxxxxxxx.
Create an SSH tunnel to the master node
There are a number of ways to create an SSH tunnel. The instructions in this post show you how to use PuTTY (on Windows) to create a tunnel using local port forwarding. Local port forwarding allows you to specify a local port that is used to forward data to the identified remote port on the master node. You can quickly learn how to create an SSH tunnel with the Amazon EMR master node on other platforms.
To create an SSH tunnel in PuTTY:
- Double-click putty.exe to start PuTTY. You can also launch PuTTY from the Windows programs list.
- If necessary, in the Category list click Session.
- In the Host Name or IP address field, type hadoop@MasterPublicDNS. For example: hadoop@ec2-###-##-##-###.compute-1.amazonaws.com.
- In the Category list, expand Connection > SSH and click Auth.
- For Private key file for authentication, click Browse and select your EC2 key pair private key file (ending in .ppk for PuTTY or .pem for OpenSSH).
- In the Category list, expand Connection > SSH and click Tunnels.
- In the Source port field:
- For Hive .13 (AMI 3.2 and higher), type 10000.
- For Hive .11 (AMI 3.1.1 and lower), type 10004.
- In the Destination field:
- For Hive .13 (AMI 3.2 and higher), type MasterPublicDNS:10000. For example: ec2-###-##-##-###.compute-1.amazonaws.com:10000 (this identifies the Hive JDBC port 10000 as the tunnel destination on the master node).
- For Hive .11 (AMI 3.1.1. and lower). type 10004 (this identifies the Hive JDBC port 10004 as the tunnel destination on the master node).
- Leave the Local and Auto options selected.
- Click Add. You should see an entry in the Forwarded ports box similar to: L10000 ec2-###-##-##-###.compute-1.amazonaws.com:10000.
- Click Open and then click Yes to dismiss the PuTTY security alert. The tunnel remains active until you close PuTTY.
Use SQL Workbench to Connect to the Master Node Using the JDBC Driver
- Launch SQL Workbench from the directory you created previously.
- In the Select Connection Profile dialog box, click Manage Drivers.
- In the upper-left corner of the window, click the Create a new entry (blank page) icon.
- In the Manage drivers dialog box, type EMR in the Name box.
- Click the Select the jar files… button to choose the driver files.
- Navigate to the directory you created earlier that contains the JDBC driver (for example, HiveJDBCAmazonHiveJDBC1.0.0.1000Amazon_HiveJDBC3_1.0.0.1000).
- Select all of the files and click Open.
- In the Please select one driver dialog box, do one of the following:
- For Hive .13, choose com.amazon.hive.jdbc3.HS2Driver and click OK.
- For Hive .11, choose com.amazon.hive.jdbc3.HS1Driver and click OK.
- Click OK. When you are returned to the Select Connection Profile dialog box, do one of the following:
- For Hive .13, type the following in the URL field: jdbc:hive2://localhost:10000/default;AuthMech=2;UID=hadoop.
- For Hive .11, type the following in the URL field: jdbc:hive://localhost:10004/default
- Verify the Driver field is set to EMR and click OK.
When you have successfully connected to the master node, you will see the connection details at the top of the window.
Create a Hive Table and Query the Data
Next, we will create a Hive table using sample Amazon CloudFront access logs stored in Amazon S3.
Note: These sample logs are a few years old and Amazon CloudFront logs now include more fields than are included in the sample. If you are trying to query more recent Amazon CloudFront logs, you must use a different Create Table statement.
Create the CloudFront table using the following command:
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( Date STRING, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, OS String, Browser String, BrowserVersion String ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$" ) LOCATION 's3://us-east-1.elasticmapreduce.samples/cloudfront/data/';
Test that the data has been successfully made available by running the following command:
SELECT * from cloudfront_logs LIMIT 10;
Now that the table is defined, you can run some useful queries, such as:
SELECT requestip, COUNT(*) count from cloudfront_logs where date='2014-08-05' group by requestip;
Run the first query. The output will be similar to the results below:
Congratulations! You’re now able to use SQL Workbench to develop and run your Amazon EMR queries.
If you have questions or suggestions, please leave a comment below.