AWS Big Data Blog

Simplify data access for your enterprise using Amazon SageMaker Lakehouse

Organizations are increasingly using data to make decisions and drive innovation. However, building data-driven applications can be challenging. It often requires multiple teams working together and integrating various data sources, tools, and services. For example, creating a targeted marketing app involves data engineers, data scientists, and business analysts using different systems and tools. This complexity leads to several issues: it takes time to learn multiple systems, it’s difficult to manage data and code across different services, and controlling access for users across various systems is complicated. Currently, organizations often create custom solutions to connect these systems, but they want a more unified approach that them to choose the best tools while providing a streamlined experience for their data teams. The use of separate data warehouses and lakes has created data silos, leading to problems such as lack of interoperability, duplicate governance efforts, complex architectures, and slower time to value.

You can use Amazon SageMaker Lakehouse to achieve unified access to data in both data warehouses and data lakes. Through SageMaker Lakehouse, you can use preferred analytics, machine learning, and business intelligence engines through an open, Apache Iceberg REST API to help ensure secure access to data with consistent, fine-grained access controls.

Solution overview

Let’s consider Example Retail Corp, which is facing increasing customer churn. Its management wants to implement a data-driven approach to identify at-risk customers and develop targeted retention strategies. However, the customer data is scattered across different systems and services, making it challenging to perform comprehensive analyses. Today, Example Retail Corp manages sales data in its data warehouse and customer data in Apache Iceberg tables in Amazon Simple Storage Service (Amazon S3). It uses Amazon EMR Serverless for data processing and machine learning. For governance, it uses AWS Glue Data Catalog as the central technical catalog and AWS Lake Formation as the permission store for enforcing fine-grained access controls. Its main objective is to implement a unified data management system that now combines data from varied sources, enables secure access across enterprise, and allow disparate teams to use preferred tools to predict, analyze, and consume customer churn information.

Let’s examine how Example Retail Corp can use SageMaker Lakehouse to achieve its unified data management vision using this reference architecture diagram.

Personas

There are four personas used in this solution.

  • The Data Lake Admin has an AWS Identity and Access Management (IAM) admin role and is a Lake Formation administrator responsible for managing user permissions to catalog objects using Lake Formation.
  • The Data Warehouse Admin has an IAM admin role and manages databases in Amazon Redshift.
  • The Data Engineer has an IAM ETL role and runs the extract, transform, and load (ETL) pipeline using Spark to populate the Lakehouse catalog on RMS.
  • The Data Analyst has an IAM analyst role and performs churn analysis on SageMaker Lakehouse data using Amazon Athena and Amazon Redshift.

Dataset

The following table describes the elements of the dataset.

Schema Table Data source
public customer_churn Lakehouse catalog with storage on RMS
customerdb customer Lakehouse catalog with storage on Amazon S3
sales store_sales Data warehouse

Prerequisites

To follow along on the solution walkthrough, you need to have the following:

  1. Create a user defined IAM role following the instruction in Requirements for roles used to register locations. For this post, we will use IAM role LakeFormationRegistrationRole.
  2. An Amazon Virtual Private Cloud (Amazon VPC) with private and public subnets.
  3. Create an S3 bucket. For this post, we will use customer_data as the bucket name.
  4. Create an Amazon Redshift serverless endpoint called sales_dw which will host store_sales dataset.
  5. Create an Amazon Redshift serverless endpoint called sales_analysis_dw for churn analysis by sales analysts.
  6. Create an IAM role named DataTransferRole following the instructions in Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.
  7. Install or update the latest version of the AWS CLI. For instructions, see Installing or updating to the latest version of the AWS CLI.
  8. Create a data lake admin using the instructions in Create a data lake administrator. For this post, we will use an IAM role called Admin.

Configure Datalake administrators :

Sign in to the AWS Management Console as Admin and go to AWS Lake Formation. In the navigation pane, choose Administration roles and then choose Tasks under Administration. Under Data lake administrators, choose Add:

  1. In the Add administrators page, under Access type, choose Data lake administrator.
  2. Under IAM users and roles, select Admin. Choose Confirm.
  3. On the Add administrators page, for Access type select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and choose Conrm. This step enables Amazon Redshift to discover and access catalog objects in AWS Glue Data Catalog.

Solution walkthrough

Create a customer table in the Amazon S3 data lake in AWS Glue Data Catalog

  1. Create an AWS Glue database called customerdb in the default catalog in your account by going to the AWS Lake Formation console and choosing Databases in the navigation pane.
  2. Select the database that you just created and choose Edit.
  3. Clear the checkbox Use only IAM access control for new tables in this database.
  4. Sign in to the Athena console as Admin and select Workgroup that the role has access to. Run the following SQL:
    CREATE EXTERNAL TABLE `tempcustomer`(
      `c_salutation` string, 
      `c_preferred_cust_flag` string, 
      `c_first_sales_date_sk` int, 
      `c_customer_sk` int, 
      `c_login` string, 
      `c_current_cdemo_sk` int, 
      `c_first_name` string, 
      `c_current_hdemo_sk` int, 
      `c_current_addr_sk` int, 
      `c_last_name` string, 
      `c_customer_id` string, 
      `c_last_review_date_sk` int, 
      `c_birth_month` int, 
      `c_birth_country` string, 
      `c_birth_year` int, 
      `c_birth_day` int, 
      `c_first_shipto_date_sk` int, 
      `c_email_address` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://customer_data/tempcustomer'
    
    INSERT INTO tempcustomer
    VALUES('Dr.','N',2452077,13251813,'Y',1381546,'Joyce',2645,2255449,'Deaton','AAAAAAAAFOEDKMAA',2452543,1,'GREECE',1987,29,2250667,'Joyce.Deaton@qhtrwert.edu'),
    ('Dr.','N',2450637,12755125,'Y',1581546,'Daniel',9745,4922716,'Dow','AAAAAAAAFLAKCMAA',2432545,1,'INDIA',1952,3,2450667,'Daniel.Cass@hz05IuguG5b.org'),
    ('Dr.','N',2452342,26009249,'Y',1581536,'Marie',8734,1331639,'Lange','AAAAAAAABKONMIBA',2455549,1,'CANADA',1934,5,2472372,'Marie.Lange@ka94on0lHy.edu'),
    ('Dr.','N',2452342,3270685,'Y',1827661,'Wesley',1548,11108235,'Harris','AAAAAAAANBIOBDAA',2452548,1,'ROME',1986,13,2450667,'Wesley.Harris@c7NpgG4gyh.edu'),
    ('Dr.','N',2452342,29033279,'Y',1581536,'Alexandar',8262,8059919,'Salyer','AAAAAAAAPDDALLBA',2952543,1,'SWISS',1980,6,2650667,'Alexander.Salyer@GxfK3iXetN.edu'),
    ('Miss','N',2452342,6520539,'Y',3581536,'Jerry',1874,36370,'Tracy','AAAAAAAALNOHDGAA',2452385,1,'ITALY',1957,8,2450667,'Jerry.Tracy@VTtQp8OsUkv2hsygIh.edu')
    
    CREATE TABLE customer
    WITH (table_type = 'ICEBERG',
    format = 'PARQUET',
    location = 's3://customer_data/customer/',
    is_external = false
    ) as select * from tempcustomer;
  5. Register the S3 bucket with Lake Formation:
    • Sign in to the Lake Formation console as Data Lake Admin.
    • In the navigation pane, choose Administration, and then choose Data lake locations.
    • Choose Register location.
    • For the Amazon S3 path, enter s3://customer_data/.
    • For the IAM role, choose LakeFormationRegistrationRole.
    • For Permission mode, select Lake Formation.
    • Choose Register location.

Create the salesdb database in Amazon Redshift

  1. Sign in to the Redshift endpoint sales_dw as Admin user. Run following script to create a database named salesdb.
    Create database salesdb;
  2. Connect to salesdb. Run the following script to create schema sales and the store_sales table and populate it with data.
    Create schema sales;
    CREATE TABLE sales.store_sales (
        sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
        customer_sk INTEGER NOT NULL,
        sale_date DATE NOT NULL,
        sale_amount DECIMAL(10, 2) NOT NULL,
        product_name VARCHAR(100) NOT NULL,
        last_purchase_date DATE
    );
    
    INSERT INTO sales.store_sales (customer_sk, sale_date, sale_amount, product_name, last_purchase_date)
    VALUES
        (13251813, '2023-01-15', 150.00, 'Widget A', '2023-01-15'),
        (29033279, '2023-01-20', 200.00, 'Gadget B', '2023-01-20'),
        (12755125, '2023-02-01', 75.50, 'Tool C', '2023-02-01'),
        (26009249, '2023-02-10', 300.00, 'Widget A', '2023-02-10'),
        (3270685, '2023-02-15', 125.00, 'Gadget B', '2023-02-15'),
        (6520539, '2023-03-01', 100.00, 'Tool C', '2023-03-01'),
        (10251183, '2023-03-10', 250.00, 'Widget A', '2023-03-10'),
        (10251283, '2023-03-15', 180.00, 'Gadget B', '2023-03-15'),
        (10251383, '2023-04-01', 90.00, 'Tool C', '2023-04-01'),
        (10251483, '2023-04-10', 220.00, 'Widget A', '2023-04-10'),
        (10251583, '2023-04-15', 175.00, 'Gadget B', '2023-04-15'),
        (10251683, '2023-05-01', 130.00, 'Tool C', '2023-05-01'),
        (10251783, '2023-05-10', 280.00, 'Widget A', '2023-05-10'),
        (10251883, '2023-05-15', 195.00, 'Gadget B', '2023-05-15'),
        (10251983, '2023-06-01', 110.00, 'Tool C', '2023-06-01'),
        (10251083, '2023-06-10', 270.00, 'Widget A', '2023-06-10'),
        (10252783, '2023-06-15', 185.00, 'Gadget B', '2023-06-15'),
        (10253783, '2023-07-01', 95.00, 'Tool C', '2023-07-01'),
        (10254783, '2023-07-10', 240.00, 'Widget A', '2023-07-10'),
        (10255783, '2023-07-15', 160.00, 'Gadget B', '2023-07-15');

Create the churn_lakehouse RMS catalog in Glue Data Catalog

This catalog will contain the customer churn table with managed RMS storage, which will be populated using Amazon EMR.

We will manage the customer churn data in an AWS Glue managed catalog with managed RMS storage. This data is produced from an analysis conducted in EMR Serverless and is accessible in the presentation layer to serve to business intelligence (BI) applications.

Create Lakehouse (RMS) catalog

  1. Sign in to the Lake Formation console as Data Lake Admin.
  2. In the left navigation pane, choose Data Catalog, and then Catalogs New. Choose Create catalog.
  1. Provide the details for the catalog:
    • Name: Enter churn_lakehouse.
    • Type: Select Managed catalog.
    • Storage: Select Redshift.
    • Under Access from engines, make sure that Access this catalog from Iceberg compatible engines is selected.
    • Choose Next.
    • Under Principals, select IAM users and roles. Under IAM users and roles, select the Admin Under Catalog permissions, select Super user.
    • Choose Add, and then choose Create catalog.

Access churn_lakehouse RMS catalog from Amazon EMR Spark engine

  1. Set up an EMR Studio.
  2. Create an EMR Serverless application using CLI command.
    aws emr-serverless create-application --region <aws_region> \
    --name 'Churn_Analysis' \
    --type 'SPARK' \
    --release-label emr-7.5.0 \
    --network-configuration '{"subnetIds": ["<subnet2>", "<subnet2>"], "securityGroupIds": [<security_group>]}'

Sign in to EMR Studio and use the EMR Studio Workspace

  1. Sign in to the EMR Studio console and choose Workspaces in the navigation pane, and then choose Create Workspace.
  2. Enter a name and a description for the Workspace.
  3. Choose Create Workspace. A new tab containing JupyterLab will open automatically when the Workspace is ready. Enable pop-ups in your browser if necessary.
  4. Choose the Compute icon in the navigation pane to attach the EMR Studio Workspace with a compute engine.
  5. Select EMR Serverless application for Compute type.
  6. Choose Churn_Analysis for EMR-S Application.
  7. For Runtime role, choose Admin.
  8. Choose Attach.

Download the notebook, import it, choose PySpark kernel and execute the cells that will create the table.

Manage your users’ fine-grained access to catalog objects using AWS Lake Formation

Grant the following permissions to the Analyst role on the resources as shown in the following table.

Catalog Database Table Permission
<account_id>:churn_lakehouse/dev public customer_churn Column permission:
<account_id> customerdb customer Table permission
<account_id>:sales_lakehouse/salesdb sales store_sales All table permission
  1. Sign in to the Lake Formation console as Data Lake Admin. In the navigation pane, choose Data Lake Permissions, and then choose Grant.
  2. For IAM user and roles, choose Analyst IAM role. For resources choose as shown below and grant.
  3. For IAM user and roles, choose Analyst IAM Role. For resource choose as shown below and grant.
  4. For IAM user and roles, choose Analyst IAM Role. For resource choose as shown below and grant.

Perform churn analysis using multiple engines:

Using Athena

Sign in to the Athena console using the IAM Analyst role, select the workgroup that the role has access to. Run the following SQL combining data from the data warehouse and Lake House RMS catalog for churn analysis:

SELECT 
c.c_customer_id,
c.c_first_name,
c.c_last_name,
c.c_email_address,
ss.sale_amount,
cc.is_churned
FROM 
    "customerdb"."customer" c
LEFT JOIN 
    "sales_lakehouse/salesdb"."sales"."store_sales" ss ON c.c_customer_sk = ss.customer_sk
LEFT JOIN 
    "churn_lakehouse/dev"."public"."customer_churn" cc ON c.c_customer_sk  = cc.customer_id
WHERE cc.is_churned = true
;

The following figure shows the results, which include customer IDs, names, and other information.

Using Amazon Redshift

Sign in to the Redshift Sale cluster QEV2 using the IAM Analyst role. Sign in using temporary credentials using your IAM identity and run the following SQL command:

SELECT 
c.c_customer_id,
c.c_first_name,
c.c_last_name,
c.c_email_address,
ss.sale_amount,
cc.is_churned
FROM 
   "awsdatacatalog"."customerdb"."customer" c
LEFT JOIN 
    "salesdb@sales_lakehouse"."sales"."store_sales" ss ON c.c_customer_sk = ss.customer_sk
LEFT JOIN 
    "dev@churn_lakehouse"."public"."customer_churn" cc ON c.c_customer_sk  = cc.customer_id
WHERE cc.is_churned = true
;

The following figure shows the results, which include customer IDs, names, and other information.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

  1. Deletethe Redshift Serverless workgroups.
  2. Deletethe Redshift Serverless associated namespace.
  3. Delete EMR Studio and Application created.
  4. Delete Glue resources and Lake Formation permissions.
  5. Empty the bucket and delete the bucket.

Conclusion

In this post, we showcased how you can use Amazon SageMaker Lakehouse to achieve unified access to data across your data warehouses and data lakes. With unified access, you can use preferred analytics, machine learning, and business intelligence engines through an open, Apache Iceberg REST API and secure access to data with consistent, fine-grained access controls. Try Amazon SageMaker Lakehouse in your environment and share your feedback with us.


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with product team and customer to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Angel Conde Manjon is a Sr. PSA Specialist on Data & AI, based in Madrid, and focuses on EMEA South and Israel. He has previously worked on research related to data analytics and artificial intelligence in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.