Visualize Big Data with Amazon QuickSight, Presto, and Apache Spark on Amazon EMR

February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more.

Last December, we introduced the Amazon Athena connector in Amazon QuickSight, in the Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight post.

The connector allows you to visualize your big data easily in Amazon S3 using Athena’s interactive query engine in a serverless fashion. This turned out to be a very popular combination, as customers benefit from the speed, agility, and cost benefit that serverless business intelligence (BI) and analytics architecture brings.

Today, we’re excited to announce two new native connectors in QuickSight for big data analytics: Presto and Spark. With the Presto and SparkSQL connector in QuickSight, you can easily create interactive visualizations over large datasets using Amazon EMR.

EMR provides a simple and cost effective way to run highly distributed processing frameworks such as Presto and Spark when compared to on-premises deployments. EMR provides you with the flexibility to define specific compute, memory, storage, and application parameters and optimize your analytic requirements.

In this post, I walk you through connecting QuickSight to an EMR cluster running Presto. If you’d like a walkthrough with Spark, let us know in the comments section!

Presto overview

Presto is an open source, distributed SQL query engine for running interactive analytic queries against data sources ranging from gigabytes to petabytes. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can run on multiple data sources, including Amazon S3.

Presto’s execution framework is fundamentally different from that of Hive/MapReduce. Presto has a custom query and execution engine where the stages of execution are pipelined, similar to a directed acyclic graph (DAG), and all processing occurs in memory to reduce disk I/O. This pipelined execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data exploration over large data sets.

Walkthrough

Use the following steps to connect QuickSight to an EMR cluster running Presto:

Create an EMR cluster with the latest 5.5.0 release.
Configure LDAP for user authentication in QuickSight.
Configure SSL using a QuickSight supported certificate authority (CA).
Create tables for Presto in the Hive metastore.
Whitelist the QuickSight IP address range in your EMR master security group rules.
Connect QuickSight to Presto and create some visualizations.

Prerequisites

You need run Presto version 0.167, at a minimum, which is the first release that supports LDAP authentication. LDAP authentication is a requirement for the Presto and Spark connectors and QuickSight refuses to connect if LDAP is not configured on your cluster.

Create an EMR cluster with release version 5.5.0

In the EMR console, use the Quick Create option to create a cluster. For this post, use most of the default settings with a few exceptions. To install both Presto and Spark on your cluster (and customize other settings), create your cluster from the Advanced Options wizard instead.

Make sure that EMR release 5.5.0 is selected and under Applications, choose Presto. If you have an EC2 key pair, you can use it. Otherwise, create a key pair (.PEM file) and then return to this page to create the cluster.

Make sure that you configure your cluster’s security group inbound rules to allow SSH from your machine’s IP address range.

Configure LDAP for user authentication in QuickSight

After your cluster is in a running state, connect using SSH to your cluster to configure LDAP authentication.

To SSH into your EMR cluster, use the following commands in the terminal:

chmod 600 ~/YOUR_PEM_FILE.pem
ssh -i ~/YOUR_PEM_FILE.pem hadoop@YOUR_MASTER_PUBLIC_DNS_FROM_EMR_CLUSTER

After you log in, install OpenLDAP, configure it, and create users in the directory. For more about configuring LDAP, see Editing /etc/openldap/slapd.conf in the OpenLDAP documentation.

# Install LDAP Server
 sudo yum install openldap openldap-servers openldap-clients
# Create the config files
sudo cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG
sudo cp /usr/share/openldap-servers/slapd.conf.obsolete /etc/openldap/slapd.conf
# Bounce LDAP
sudo service slapd restart

After LDAP is installed and restarted, you issue a couple of commands to change the LDAP password. First, generate a hash for the LDAP root password and save the output hash that looks like this:

{SSHA}DmD616c3yZyKndsccebZK/vmWiaQde83

Issue the following command and set a root password for LDAP when prompted:

slappasswd

Now, prepare the commands to set the password for the LDAP root. Make sure to replace the hash below with the one that you generated in the previous step:

cat > /tmp/config.ldif <<EOF
dn: olcDatabase={0}config,cn=config
changetype: modify
add: olcRootPW
olcRootPW: {SSHA}DmD616c3yZyKndsccebZK/vmWiaQde83

dn: olcDatabase={2}bdb,cn=config
changetype: modify
add: olcRootPW
olcRootPW: {SSHA}DmD616c3yZyKndsccebZK/vmWiaQde83
-
replace: olcRootDN
olcRootDN: cn=dev,dc=example,dc=com
-
replace: olcSuffix
olcSuffix: dc=example,dc=com
EOF

Run the following command to execute the above commands against LDAP:

sudo ldapadd -Y EXTERNAL -H ldapi:/// -f /tmp/ config.ldif

Next, create a user account with password in the LDAP directory with the following commands. When prompted for a password, use the LDAP root password that you created in the previous step.

cat > /tmp/accounts.ldif <<EOF
dn: dc=example,dc=com
objectclass: domain
objectclass: top
dc: example

dn: ou= dev,dc=example,dc=com
objectclass: organizationalUnit
ou: dev
description: Container for developer entries

dn: uid=<REPLACE_WITH_YOUR_USER_NAME>,ou=dev,dc=example,dc=com
uid: <REPLACE_WITH_YOUR_USER_NAME>
objectClass: inetOrgPerson
userPassword: <REPLACE_WITH_STRONG_PASSWORD>
sn: <REPLACE_WITH_SURNAME>
cn: dev
EOF

ldapadd -D "cn=dev,dc=example,dc=com" -W -f /tmp/accounts.ldif

You now have OpenLDAP configured on your EMR cluster running Presto and a user that you later use to authenticate against when connecting to Presto.

Configure SSL using a QuickSight supported certificate authority

To ensure that any communication between QuickSight and Presto is secured, QuickSight requires that the connection to be established with SSL enabled. You need to obtain a certificate from a certificate authority (CA) that QuickSight trusts. You can find the full list of public CAs accepted by QuickSight in the Network and Database Configuration Requirements topic.

To set up SSL on LDAP and Presto, obtain the following three SSL certificate files from your CA and store them in the /home/hadoop/ directory.

Certificate key file
Certificate file
CA certificate

Configure the keys in LDAP with the following commands:

cat > /tmp/ca.ldif <<EOF
dn: cn=config
replace: olcTLSCertificateKeyFile
olcTLSCertificateKeyFile: /home/hadoop/certificateKey.pem

replace: olcTLSCertificateFile
olcTLSCertificateFile: /home/hadoop/certificate.pem

replace: olcTLSCACertificateFile
olcTLSCACertificateFile: /home/hadoop/cacertificate.pem
EOF

sudo ldapmodify -Y EXTERNAL -H ldapi:/// -f /tmp/ca.ldif

Now, enable SSL in LDAP by editing the /etc/sysconfi/ldap file and set SLAPD_LDAPS=yes:

sudo vi /etc/sysconfig/ldap

SLAPD_LDAPS=yes

sudo service slapd restart

Use the following commands to generate keystore. You will be prompted to provide a password for the keystore.

openssl pkcs12 -inkey certificatekey.pem -in certificate.pem -export -out server-key.p12

keytool -importkeystore -srckeystore server-key.p12 -srcstoretype PKCS12 -destkeystore server.keystore

Edit the configuration files for Presto in EMR.

SERVERNAME=<PUBLIC_DNS_NAME_OF_EMR_CLUSTER>
cd /etc/presto/conf

# Enable LDAPS auth for Presto
echo http-server.authentication.type=LDAP | sudo tee -a config.properties
echo authentication.ldap.url=ldaps://${SERVERNAME}:636 | sudo tee -a config.properties
echo authentication.ldap.user-bind-pattern=uid=\${USER},OU=dev,DC=example,DC=com | sudo tee -a config.properties

# Enable SSL for the Presto server
echo http-server.https.enabled=true | sudo tee -a config.properties
echo http-server.https.port=<PORT_NUMBER> | sudo tee -a config.properties
echo http-server.https.keystore.path=/home/hadoop/server.keystore | sudo tee -a config.properties
echo http-server.https.keystore.key=<KEYSTORE_PASSWORD> | sudo tee -a config.properties

# Bounce Presto to pick up the new config
sudo pkill presto
# wait until presto is up
while [[ 1 ]]; do pgrep presto; if [ $? -eq 0 ]; then break; else echo -n .; sleep 1; fi; done

Create tables for Presto in the Hive metastore

Now that you have a running EMR cluster with Presto and LDAP set up, you can load some sample data into the cluster for analysis. Use the same CloudFront log sample data set that is available for Athena. The following SQL query creates a table in EMR and loads the sample data set into it:

# Run hive
$hive

#Create table and load data
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  Date Date,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  OS String,
  Browser String,
  BrowserVersion String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\()+[\()([^\;]+).*\%20([^\/]+)[\/](.*)$"
) LOCATION 's3://athena-examples/cloudfront/plaintext/'; 

# quit hive
quit

Try to query the data using the Presto CLI with the following commands:

#Run the Presto CLI
presto-cli --server https://<PUBLIC_DNS_NAME_OF_EMR_CLUSTER>:<PORT_NUMER>     --user <USERNAME> --password --catalog hive

#Issue query to Presto
SELECT * FROM cloudfront_logs limit 10;

You should see an output from Presto like the following:

Whitelist the QuickSight IP address range in your EMR master security group rules

Now you’re ready to connect QuickSight to Presto. For QuickSight to connect to Presto, you need to make sure that Presto is reachable by QuickSight’s public endpoints by adding QuickSight’s IP address ranges to your EMR master node security group.

Connect QuickSight to Presto and create some visualizations

If you have not already signed up for QuickSight, you can do so at https://quicksight.aws. QuickSight offers a 1 user and 1 GB perpetual free tier.

After you’re signed up for QuickSight, navigate to the New Analysis page and the New Data Set page. You see the new Presto and Spark connector as in the following screenshot.

Open the Presto connector, provide the connection details in the modal window, and choose Create data source.

Select the default schema and choose the cloudfront_logs table that you just created.

In QuickSight, you can choose between importing the data in SPICE for analysis or directly querying your data in Presto. SPICE is an in-memory optimized columnar engine in QuickSight that enable fast, interactive visualization as you explore your data. For this post, choose to import the data into SPICE and choose Visualize.

In the analysis view, you can see the notification that shows import is complete with 4996 rows imported. On the left, you see the list of fields available in the data set and below, the various types of visualizations from which you can choose.

QuickSight makes it easy for you to create visualizations and analyze data with AutoGraph, a feature that automatically selects the best visualization for you based on selected fields.

To create a visualization, select the fields on the left panel. In this case, look at the number of connections to CloudFront ordered by the various OS types, by selecting the OS field. Additionally, you can select the bytes fields to look at total bytes transferred by OS instead of count.

Summary

You just finished creating an EMR cluster, setting up Presto and LDAP with SSL, and using QuickSight to visualize your data. I hope this post was helpful. Feel free to reach out if you have any questions or suggestions.

Learn more

To learn more about these capabilities and start using them in your dashboards, check out the QuickSight User Guide.

Stay engaged

If you have questions and suggestions, you can post them on the QuickSight forum.

Not a QuickSight user

Go to the QuickSight website to get started for FREE.

AWS Big Data Blog