AWS Big Data Blog
Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints
Many customers are extending their data warehouse capabilities to their data lake with Amazon Redshift. They are looking to further enhance their security posture where they can enforce access policies on their data lakes based on Amazon Simple Storage Service (Amazon S3). Furthermore, they are adopting security models that require access to the data lake through their private networks.
Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries on data stored in Amazon S3. Redshift Spectrum uses the AWS Glue Data Catalog as a Hive metastore. With a provisioned Redshift data warehouse, Redshift Spectrum compute capacity runs from separate dedicated Redshift servers owned by Amazon Redshift that are independent of your Redshift cluster. When enhanced VPC routing is enabled for your Redshift cluster, Redshift Spectrum connects from the Redshift VPC to an elastic network interface (ENI) in your VPC. Because it uses separate Redshift dedicated clusters, to force all traffic between Redshift and Amazon S3 through your VPC, you need to turn on enhanced VPC routing and create a specific network path between your Redshift data warehouse VPC and S3 data sources.
When using an Amazon Redshift Serverless instance, Redshift Spectrum uses the same compute capacity as your serverless workgroup compute capacity. To access your S3 data sources from Redshift Serverless without traffic leaving your VPC, you can use the enhanced VPC routing option without the need for any additional network configuration.
AWS Lake Formation offers a straightforward and centralized approach to access management for S3 data sources. Lake Formation allows organizations to manage access control for Amazon S3-based data lakes using familiar database concepts such as tables and columns, along with more advanced options such as row-level and cell-level security. Lake Formation uses the AWS Glue Data Catalog to provide access control for Amazon S3.
In this post, we demonstrate how to configure your network for Redshift Spectrum to use a Redshift provisioned cluster’s enhanced VPC routing to access Amazon S3 data through Lake Formation access control. You can set up this integration in a private network with no connectivity to the internet.
Solution overview
With this solution, network traffic is routed through your VPC by enabling Amazon Redshift enhanced VPC routing. This routing option prioritizes the VPC endpoint as the first route priority over an internet gateway, NAT instance, or NAT gateway. To prevent your Redshift cluster from communicating with resources outside of your VPC, it’s necessary to remove all other routing options. This ensures that all communication is routed through the VPC endpoints.
The following diagram illustrates the solution architecture.
The solution consists of the following steps:
- Create a Redshift cluster in a private subnet network configuration:
- Enable enhanced VPC routing for your Redshift cluster.
- Modify the route table to ensure no connectivity to the public network.
- Create the following VPC endpoints for Redshift Spectrum connectivity:
- AWS Glue interface endpoint.
- Lake Formation interface endpoint.
- Amazon S3 gateway endpoint.
- Analyze Amazon Redshift connectivity and network routing:
- Verify network routes for Amazon Redshift in a private network.
- Verify network connectivity from the Redshift cluster to various VPC endpoints.
- Test connectivity using the Amazon Redshift query editor v2.
This integration uses VPC endpoints to establish a private connection from your Redshift data warehouse to Lake Formation, Amazon S3, and AWS Glue.
Prerequisites
To set up this solution, You need basic familiarity with the AWS Management Console, an AWS account, and access to the following AWS services:
- AWS Glue
- AWS Identity and Access Management (IAM)
- Lake Formation
- Amazon Redshift
- Amazon S3
- Amazon Virtual Private Cloud (Amazon VPC)
Additionally, you must have integrated Lake Formation with Amazon Redshift to access your S3 data lake in non-private network. For instructions, refer to Centralize governance for your data lake using AWS Lake Formation while enabling a modern data architecture with Amazon Redshift Spectrum.
Create a Redshift cluster in a private subnet network configuration.
The first step is to configure your Redshift cluster to only allow network traffic through your VPC and prevent any public routes. To accomplish this, you must enable enhanced VPC routing for your Redshift cluster. Complete the following steps:
- On the Amazon Redshift console, navigate to your cluster.
- Edit your network and security settings.
- For Enhanced VPC routing, select Turn on.
- Disable the Publicly accessible option.
- Choose Save changes and modify the cluster to apply the updates. You now have a Redshift cluster that can only communicate through the VPC. Now you can modify the route table to ensure no connectivity to the public network.
- On the Amazon Redshift console, make a note of the subnet group and identify the subnet associated with this subnet group.
- On the Amazon VPC console, identify the route table associated with this subnet and edit to remove the default route to the NAT gateway.
If you cluster is in a public subnet, you may have to remove the internet gateway route. If subnet is shared among other resources, it may impact their connectivity.
Your cluster is now in a private network and can’t communicate with any resources outside of your VPC.
Create VPC endpoints for Redshift Spectrum connectivity
After you configure your Redshift cluster to operate within a private network without external connectivity, you need to establish connectivity to the following services through VPC endpoints:
- AWS Glue
- Lake Formation
- Amazon S3
Create an AWS Glue endpoint
To begin with, Redshift Spectrum connects to AWS Glue endpoints to retrieve information from the AWS Data Glue Catalog. To create a VPC endpoint for AWS Glue, complete the following steps:
- On the Amazon VPC console, choose Endpoints in the navigation pane.
- Choose Create endpoint.
- For Name tag, enter an optional name.
- For Service category, select AWS services.
- In the Services section, search for and select your AWS Glue interface endpoint.
- Choose the appropriate VPC and subnets for your endpoint.
- Configure the security group settings and review your endpoint settings.
- Choose Create endpoint to complete the process.
After you create the AWS Glue VPC endpoint, Redshift Spectrum will be able to retrieve information from the AWS Glue Data Catalog within your VPC.
Create a Lake Formation endpoint
Repeat the same process to create a Lake Formation endpoint:
- On the Amazon VPC console, choose Endpoints in the navigation pane.
- Choose Create endpoint.
- For Name tag, enter an optional name.
- For Service category, select AWS services.
- In the Services section, search for and select your Lake Formation interface endpoint.
- Choose the appropriate VPC and subnets for your endpoint.
- Configure the security group settings and review your endpoint settings.
- Choose Create endpoint.
You now have connectivity for Amazon Redshift to Lake Formation and AWS Glue, which allows you to retrieve the catalog and validate permissions on the data lake.
Create an Amazon S3 endpoint
The next step is to create a VPC endpoint for Amazon S3 to enable Redshift Spectrum to access data stored in Amazon S3 via VPC endpoints:
- On the Amazon VPC console, choose Endpoints in the navigation pane.
- Choose Create endpoint.
- For Name tag, enter an optional name.
- For Service category, select AWS services.
- In the Services section, search for and select your Amazon S3 gateway endpoint.
- Choose the appropriate VPC and subnets for your endpoint.
- Configure the security group settings and review your endpoint settings.
- Choose Create endpoint.
With the creation of the VPC endpoint for Amazon S3, you have completed all necessary steps to ensure that your Redshift cluster can privately communicate with the required services via VPC endpoints within your VPC.
It’s important to ensure that the security groups attached to the VPC endpoints are properly configured, because an incorrect inbound rule can cause your connection to timeout. Verify that the security group inbound rules are correctly set up to allow necessary traffic to pass through the VPC endpoint.
Analyze traffic and network topology
You can use the following methods to verify the network paths from Amazon Redshift to other endpoints.
Verify network routes for Amazon Redshift in a private network
You can use an Amazon VPC resource map to visualize Amazon Redshift connectivity. The resource map shows the interconnections between resources within a VPC and the flow of traffic between subnets, NAT gateways, internet gateways, and gateway endpoints. As shown in the following screenshot, the highlighted subnet where the Redshift cluster is running doesn’t have connectivity to a NAT gateway or internet gateway. The route table associated with the subnet can reach out to Amazon S3 via VPC endpoint only.
Note that AWS Glue and Lake Formation endpoints are interface endpoints and not visible on a resource map.
Verify network connectivity from the Redshift cluster to various VPC endpoints
You can verify connectivity from your Redshift cluster subnet to all VPC endpoints using the Reachability Analyzer. The Reachability Analyzer is a configuration analysis tool that enables you to perform connectivity testing between a source resource and a destination resource in your VPCs. Complete the following steps:
- On the Amazon Redshift console, navigate to the Redshift cluster configuration page and note the internal IP address.
- On the Amazon EC2 console, search for your ENI by filtering by the IP address.
- Choose the ENI associated with your Redshift cluster and choose Run Reachability Analyzer.
- For Source type, choose Network interfaces.
- For Source, choose the Redshift ENI.
- For Destination type, choose VPC endpoints.
- For Destination, choose your VPC endpoint.
- Choose Create and analyze path.
- When analysis is complete, view the analysis to see reachability.
As shown in the following screenshot, the Redshift cluster has connectivity to the Lake Formation endpoint.
You can repeat these steps to verify network reachability for all other VPC endpoints.
Test connectivity by running a SQL query from the Amazon Redshift query editor v2
You can verify connectivity by running a SQL query with your Redshift Spectrum table using the Amazon Redshift query editor, as shown in the following screenshot.
Congratulations! You are able to successfully query from Redshift Spectrum tables from a provisioned cluster while enhanced VPC routing is enabled for traffic to stay within your AWS network.
Clean up
You should clean up the resources you created as part of this exercise to avoid unnecessary cost to your AWS account. Complete the following steps:
- On the Amazon VPC console, choose Endpoints in the navigation pane.
- Select the endpoints you created and on the Actions menu, choose Delete VPC endpoints.
- On the Amazon Redshift console, navigate to your Redshift cluster.
- Edit the cluster network and security settings and select Turn off for Enhanced VPC routing.
- You can also delete your Amazon S3 data and Redshift cluster if you are not planning to use them further.
Conclusion
By moving your Redshift data warehouse to a private network setting and enabling enhanced VPC routing, you can enhance the security posture of your Redshift cluster by limiting access to only authorized networks.
We want to acknowledge our fellow AWS colleagues Harshida Patel, Fabricio Pinto, and Soumyajeet Patra for providing their insights with this blog post.
If you have any questions or suggestions, leave your feedback in the comments section. If you need further assistance with securing your S3 data lakes and Redshift data warehouses, contact your AWS account team.
Additional resources
- 10 Best Practices for Amazon Redshift Spectrum
- Amazon QuickSight Adds Support for Amazon Redshift Spectrum
- Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data
About the Authors
Kanwar Bajwa is an Enterprise Support Lead at AWS who works with customers to optimize their use of AWS services and achieve their business objectives.
Swapna Bandla is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. Swapna has a passion towards understanding customers data and analytics needs and empowering them to develop cloud-based well-architected solutions. Outside of work, she enjoys spending time with her family.