AWS Big Data Blog
Simplify authentication with native LDAP integration on Amazon EMR
Many companies have corporate identities stored inside identity providers (IdPs) like Active Directory (AD) or OpenLDAP. Previously, customers using Amazon EMR could integrate their clusters with Active Directory by configuring a one-way realm trust between their AD domain and the EMR cluster Kerberos realm. For more details, refer to Tutorial: Configure a cross-realm trust with an Active Directory domain.
This setup has been a key enabler to make corporate users and groups available inside EMR clusters and define access control policies to control their data access (for example, through the Amazon EMR native Apache Ranger integration).
Although this option is still available, Amazon EMR has released support for native LDAP authentication, a new security feature that simplifies the integration with OpenLDAP and Active Directory.
This feature enables the following:
- automatic configuration of security for the supported applications (HiveServer2, Trino, Presto and Livy) to use the Kerberos protocol under the hood and LDAP as external authentication. This allows a more straightforward integration from external tools that, to connect with cluster endpoints, do not have anymore to setup kerberos authentication but, instead, can simply be configured to provide an LDAP username and password
- fine-grained access control (FGAC) over who can access your EMR clusters through SSH
- fine-grained authorization policies on top of Hive Metastore database and tables if used in combination with the native Amazon EMR Apache Ranger integration.
In this post, we dive deep into the Amazon EMR LDAP authentication, showing how the authentication flow works, how to retrieve and test the needed LDAP configurations, and how to confirm an EMR cluster is properly LDAP integrated.
Using the information on this blog:
- Teams managing EMR clusters can enhance coordination with their LDAP IdP administrators in order to request the proper information and properly perform pre-configuration tests
- EMR cluster end-users can understand how straightforward it is to connect from external tools to LDAP-enabled EMR clusters compared to the previous Kerberos-based authentication
How Amazon EMR LDAP integration works
When talking about authentication in the context of EMR frameworks, we can distinguish between two levels:
- External authentication – Used by users and external components to interact with the installed frameworks
- Internal authentication – Used within the frameworks to authenticate the communications of internal components
With this new feature, internal framework authentication is still managed through Kerberos, but this is transparent to the end-users or external services that, on the other side, use a user name and password to authenticate.
The supported EMR installed frameworks implement an LDAP-based authentication method that, given a set of user name and password credentials, validates them against the LDAP endpoint and, in the case of success, enables the use of the framework.
The following diagram summarizes how the authentication flow works.
The workflow includes the following steps:
- A user connects with one of the supported endpoints (such as HiveServer2, Trino/Presto Coordinator, or Hue WebUI) and provides their corporate credentials (user name and password).
- The contacted framework uses a custom authenticator that performs the authentication using the EMR Secret Agent service running inside the cluster instances.
- The EMR Secret Agent service validates the provided credentials against the LDAP endpoint.
- In the case of success, the following occurs:
- A Kerberos principal is created for the specific user on the cluster MIT key distribution center (MIT KDC) running inside the primary node.
- The Kerberos principal keytab is created inside the home directory of the user on the primary node.
After the authentication is complete, the user can start using the framework.
Inside all the cluster instances, the SSSD service is configured to retrieve users and groups from the LDAP endpoint and make them available as system users.
The authentication flow when connecting with SSH is a bit different, and is summarized in the following diagram.
The workflow includes the following steps:
- A user connects with SSH to the EMR primary instance, providing the corporate credentials (user name and password).
- The contacted SSHD service uses the SSSD service to validate the provided credentials.
- The SSSD service validates the provided credentials against the LDAP endpoint. In the case of success, the user lands on the related home directory. At this point, the user can use the different CLIs (
beeline
,trino-cli
,presto-cli
,curl
) to access Hive, Trino/Presto, or Livy. - To use the Spark CLIs (
spark-submit
,pyspark
,spark-shell
), the user has to invoke theldap-kinit
script and provide the requested user name and password. - The authentication is performed using the EMR Secret Agent service running inside the cluster instances.
- The EMR Secret Agent service validates the provided credentials against the LDAP endpoint.
- In the case of success, the following occurs:
- A Kerberos principal is created for the specific user on the cluster MIT KDC running inside the primary node.
- The Kerberos principal keytab is created inside the home directory of the user on the primary node.
- A kerberos ticket is obtained and stored on the user Kerberos ticket cache on the primary node.
After the ldap-kinit
script completes, the user can start using the Spark CLIs.
In the following sections, we show how to retrieve the required LDAP setting values and investigate how to launch a cluster with EMR LDAP authentication and test it.
Find the proper LDAP parameters
To configure LDAP authentication for Amazon EMR, the first step is to retrieve the LDAP properties to be used to set up your cluster. You need the following information:
- The LDAP server DNS name
- A certificate in PEM format to be used to interact over Secure LDAP (LDAPS) with the LDAP endpoint
- The LDAP user search base, which is a path (or branch) on the LDAP tree from where to search users (only users belonging to this branch will be retrieved)
- The LDAP groups search base, which is a path (or branch) on the LDAP tree from where to search groups (only groups belonging to this branch will be retrieved)
- The LDAP server bind user credentials, which are the user name and password for a service user (usually called a bind user) to be used to trigger LDAP queries and retrieve user information such as user name and group membership.
With Active Directory, an AD admin can retrieve this information directly from the Active Directory Users and Computers
tool. When you choose a user in this tool, you can see the related attributes (for example, distinguishedName
). The following screenshot shows an example.
From the screenshot, we can see that the distinguishedName
for the user john is CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
, which means that john belongs to the following search bases, ordered from the most narrow to the most wide:
OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
OU=italy,OU=emr,DC=awsemr,DC=com
OU=emr,DC=awsemr,DC=com
DC=awsemr,DC=com
Depending on the amount of entries inside a company LDAP directory, using a wide search base may lead to long retrieval times and timeouts. It’s a good practice to configure the search base to be as narrow as possible in order to include all the needed users. In the preceding example, OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
may be a good search base if all the users you want to provide access to the EMR cluster are part of that Organizational Unit.
Another way to retrieve user attributes is by using the ldapsearch tool. You can use this method for Active Directory as well as OpenLDAP, and it’s extremely useful to test the connectivity with the LDAP endpoint.
The following is an example with Active Directory (OpenLDAP is similar).
The LDAP endpoint should be resolvable and reachable by Amazon Elastic Compute Cloud (Amazon EC2) EMR cluster instances via TCP on port 636. It’s suggested to run the test from an Amazon Linux 2 EC2 instance belonging to the same subnet as the EMR cluster and having the same EMR security group associated as the EMR cluster instances.
After you launch an EC2 instance, install the nc
tool and test the DNS resolution and connectivity. Assuming that DC1.awsemr.com is the DNS name for the LDAP endpoint, run the following commands:
If the DNS resolution isn’t working properly, you should receive an error like the following:
If the endpoint is not reachable, you should receive an error like the following:
In either of these cases, the networking and DNS team should be involved in order to troubleshot and solve the issues.
In case of success, the output should look like the following:
If everything works, proceed with the testing and install the openldap
clients as follows:
Then run ldapsearch
commands to retrieve information about users and groups from the LDAP endpoint. The following are sample ldapsearch
commands:
We use the following parameters:
- -x – This enables simple authentication.
- -D – This indicates the user to perform the search.
- -w – This indicates the user password.
- -H – This indicates the URL of the LDAP server.
- -b – This is the base search.
- LDAPTLS_CACERT – This indicates the LDAPS endpoint SSL PEM public certificate or the LDAPS endpoint root certificate authority SSL PEM public certificate. This can be obtained from an AD or OpenLDAP admin user.
The following is a sample output of the preceding command:
As we can see from the sample output, the user john is identified by the distinguished name CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
, and the data-engineers
group to which the user belongs (memberOf
value) is identified by the distinguished name CN=data-engineers,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com
.
We can run our ldapsearch
queries to retrieve the user and group information using a narrowed search base:
You can also apply other filters while searching. For more information about how to create LDAP filters, refer to LDAP Filters.
By running ldapsearch
commands, you can test the LDAP connectivity and LDAP properties, and determine the needed setup.
Test the solution
After you have verified that the connectivity to the LDAP endpoint is open and the LDAP configurations are correct, proceed with setting up the environment to launch an EMR LDAP-enabled cluster.
Create AWS Secret Manager secrets
Before you create the EMR security configuration, you need to create two AWS Secret Manager secrets. You use these credentials to interact with the LDAP endpoint and retrieve user details such as user name and group membership.
- On the Secrets Manager console, choose Secrets in the navigation pane.
- Choose Store a new secret.
- For Secret type, select Other type of secret.
- Create a new secret specifying the
binduser
distinguished name as the key and thebinduser
password as the value.
- Create a second secret specifying in plaintext the LDAPS endpoint SSL public certificate or the LDAPS root certificate authority public certificate.
This certificate is trusted, allowing a secure communication between the EMR cluster and the LDAPS endpoint.
Create the EMR security configuration
Complete the following steps to create the EMR security configuration:
- On the Amazon EMR console, choose Security configurations under EMR on EC2 in the navigation pane.
- Choose Create.
- For Security configuration name, enter a name.
- For Security configuration setup options, select Choose custom settings.
- For Encryption, select Turn on in-transit encryption.
- For Certificate provider type¸ select PEM.
- For Choose PEM certificate location, enter either a PEM bundle located in Amazon Simple Storage Service (Amazon S3) or a Java custom certificate provider.
Note that in-transit encryption is mandatory in order to use the LDAP authentication feature. For more information about in-transit encryption, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption. - Choose Next.
- Select LDAP for Authentication protocol.
- For LDAP server location, enter the LDAPS endpoint (
ldaps://<ldap_endpoint_DNS_name>
). - For LDAP SSL certificate, enter the second secret you created in Secrets Manager.
- For LDAP access filter, enter an LDAP filter that is applied in order to restrict access to a subset of users retrieved from the LDAP user search base. If the field is left empty, no filters are applied and all users belonging to the LDAP user search base can access the EMR LDAP-protected endpoints with their corporate credentials. The following are example filters and their functions:
- (objectClass=person) – Filter users with the attribute
objectClass
set asperson
- (memberOf=CN=admins,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com) – Filter users belonging to the
admins
group - (|(memberof=CN=data-engineers,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com)(memberof=CN=admins,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com)) – Filter users belonging either to the
data-engineers
or theadmins
group (which we use for this post)
- (objectClass=person) – Filter users with the attribute
- Enter values for LDAP user search base and LDAP group search base. Note that the two search bases do not support inline filters (for example, the following is not supported:
OU=users,OU=italy,OU=emr,DC=awsemr,DC=com?subtree?(|(memberof=CN=data-engineers,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com)(memberof=CN=admins,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com))
). - Select Turn on SSH login. This is needed only if you want your LDAP users to be able to SSH inside cluster instances with their corporate credentials. If SSH login is enabled, the LDAP access filter is needed—otherwise, SSH authentication will fail.
- For LDAP server bind credentials, enter the first secret you created in Secrets Manager.
- In the Authorization section, keep the defaults selected:
- For IAM role for applications, select Instance profile.
- For Fine-grained access control method, select None.
- Choose Next.
- Review the configuration summary and choose Create.
Launch the EMR cluster
You can launch the EMR cluster using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or any AWS SDK.
When you’re creating the EMR on EC2 cluster, be sure to specify the following configurations:
- EMR version – Use Amazon EMR 6.12.0 or above.
- Applications – Select Hadoop, Spark, Hive, Hue, Livy and Presto/Trino.
- Security configuration – Specify the security configuration you created in the previous step.
- EC2 key pair – Use an existing key pair.
- Network and security groups – Use a configuration that allows the EMR EC2 instances to interact with the LDAPS endpoint. In the Find the proper LDAP parameters section, you should have confirmed a valid setup.
Confirm the LDAP authentication is working
When the cluster is up and running, you can check the LDAP authentication is working properly.
If SSH login was enabled as part of LDAP authentication inside the EMR SecurityConfiguration, you can SSH into your cluster by specifying an LDAP user, prompting the related password when requested:
If SSH login was disabled, you can SSH inside the cluster by using the EC2 key pair specified during cluster creation:
An alternative way to access the primary instance, if you prefer, is to use Session Manager, a capability of AWS Systems Manager. For more information, refer to Connect to your Linux instance with AWS Systems Manager Session Manager.
When you’re inside the primary instance, you can test that the LDAP users and groups are properly retrieved by using the id
command. The following is a sample command to check if the user john
is properly retrieved with the related groups:
You can then test authentication on the different installed frameworks.
First, let’s retrieve the frameworks’ public certificate and store it inside a truststore. All the frameworks share the same public certificate (the one we used to set up in-transit encryption), so you can use any of the SSL protected endpoints (Hive port 10000, Presto/Trino port 8446, Livy port 8998) to retrieve it. Take the certificate from the HiveServer2 endpoint (port 10000):
Then use this truststore to securely communicate with the different frameworks.
Use the following code to test HiveServer2 authentication with beeline
:
If using Presto, test Presto authentication with the presto
CLI (provide the user password when requested):
If using Trino, test Trino authentication with the trino
CLI (provide the user password when requested):
Test Livy
authentication with curl:
Test Spark commands with pyspark
:
Note that here we tested the authentication from within the cluster, but we can interact with Trino, Hive, Presto and Livy even from outside the cluster as far as connectivity and DNS resolution are properly configured. Spark CLIs are the only ones which can be used only from inside the cluster.
To test Hue authentication, complete the following steps:
- Navigate to the Hue web UI hosted on
http://<emr_primary_node>:8888/
and provide an LDAP user name and password.
- Test SQL queries inside the Hive and Trino/Presto editors.
To test with an external SQL tool (such as DBeaver connecting to Trino), complete the following steps. Be sure to configure the EMR primary node security group so that it allows TCP traffic from the DBeaver IP to the desired framework endpoint port (for example, 10000 for HiveServer2, 8446 for Trino/Presto) and to properly configure DNS resolution on the DBeaver client machine to properly resolve the EMR primary node hostname.
- From your EMR cluster primary instance, copy to an S3 bucket the files
truststore.jks
(previously created) and/usr/lib/trino/trino-jdbc/trino-jdbc-XXX-amzn-0.jar
(change the versionXXX
depending on the EMR version). - Download on your DBeaver client machine the
truststore.jks
andtrino-jdbc-XXX-amzn-0.jar
files. - Open DBeaver and choose Database, then choose Driver Manager.
- Choose New to create a new driver.
- On the Settings tab, provide the following information:
- For Driver Name, enter
EMR Trino
. - For Class Name, enter
io.trino.jdbc.TrinoDriver
. - For URL Template, enter
jdbc:trino://{host}:{port}
.
- For Driver Name, enter
- On the Libraries tab, complete the following steps:
- Choose Add File.
- Choose the Trino JDBC driver JAR file from the local file system (
trino-jdbc-XXX-amzn-0.jar
).
- Choose OK to create the driver.
- Choose Database and New Database Connection.
- On the Main tab, specify the following:
- For Connect by, select Host.
- For Host, enter the EMR primary node.
- For Port, enter the Trino port (8446 by default).
- On the Driver properties tab, add the following properties:
- Add
SSL
withTrue
as the value. - Add
SSLTrustStorePath
with thetruststore.jks
file location as the value. - Add
SSLTrustStorePassword
with thetruststore.jks
password that you used to create it as the value.
- Add
- Choose Finish.
- Choose the created connection and choose the Connect icon.
- Enter your LDAP user name and password, then choose OK.
If everything is working, you should be able to browse the Trino catalogs, databases, and tables in the navigation pane. To run queries, choose SQL Editor, then choose Open SQL Editor.
From the SQL Editor, you can query your tables.
Next steps
The new Amazon EMR LDAP authentication feature simplifies the way users can gain access to EMR installed frameworks. When users are using a framework, you may want to govern the data they can access. For this specific topic, you can use LDAP authentication in combination with the native EMR Apache Ranger integration. For more information, refer to Integrate Amazon EMR with Apache Ranger.
Clean up
Complete the following cleanup actions to remove the resources you created following this post and avoid incurring additional costs. For this post, we clean up using the AWS CLI. You can also clean up using similar actions via the console.
- If you launched an EC2 instance to check the LDAP connectivity and don’t need it anymore, delete it with the following command (specify your instance ID):
- If you launched an EC2 instance to test DBeaver and don’t need it anymore, you can use the preceding command to delete it.
- Delete the EMR cluster with the following command (specify your EMR cluster ID):
Note that if the EMR cluster has Termination Protection enabled, before you run the preceding
terminate-clusters
command, you have to disable it. You can do so with the following command (specify your EMR cluster ID): - Delete the EMR security configuration with the following command:
- Delete the Secrets Manager secrets with the following commands:
Conclusion
In this post, we discussed how you can configure and test LDAP authentication on EMR on EC2 clusters. We discussed how to retrieve the needed LDAP settings, test connectivity with the LDAP endpoint, configure your EMR security configuration, and test that the LDAP authentication is properly working. This post also highlighted how the authentication flow is simplified compared to the standard Active Directory cross-realm trust configuration. To learn more about this feature, refer to Use Active Directory or LDAP servers for authentication with Amazon EMR.
About the Authors
Stefano Sandona is a Senior Big Data Solution Architect at AWS. He loves data, distributed systems and security. He helps customers around the world architecting secure, scalable and reliable big data platforms.
Adnan Hemani is a Software Development Engineer at AWS working with the EMR team. He focuses on the security posture of applications running on EMR clusters. He is interested in modern Big Data applications and how customers interact with them.