AWS Big Data Blog
Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse
With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.
In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To harness the full potential of their data, businesses must enable seamless access and consolidation from these varied sources. However, this task is complicated by the unique characteristics of modern systems, such as differing API protocols, implementations, and rate limits. To address these challenges and accelerate innovation, AWS Glue has recently expanded its third-party application support by introducing native connectors for 19 applications.
To utilize these new application connectors for well-defined use cases such as replication and ingestion, AWS Glue is also launching zero-ETL integration support from external applications. With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift.
Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. By directly integrating with Lakehouse, all the data is automatically cataloged and can be secured through fine-grained permissions in Lake Formation.
What is zero-ETL?
Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. It makes data available in Amazon SageMaker Lakehouse and Amazon Redshift from multiple operational, transactional, and enterprise sources. Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to prepare it for analytics, artificial intelligence (AI), and machine learning (ML) workloads. You don’t need to maintain complex ETL pipelines. We take care of the ETL for you by automating the creation and management of data replication.
What’s the difference between zero-ETL and Glue ETL?
AWS Glue now offers multiple ways for you to build data integration pipelines, depending on your integration needs.
- Zero-ETL provides service-managed replication. It’s designed for scenarios where customers need a fully managed, efficient way to replicate data from one source to AWS with minimal configuration. Zero-ETL handles the entire replication process, including schema discovery and evolution, without requiring customers to write or manage any custom logic. This approach is ideal for creating up-to-date replicas of source data in near-real-time, with AWS managing the underlying infrastructure and replication process.
- Glue ETL offers customer-managed data ingestion. It’s the preferred choice when customers need more control and customization over the data integration process or require complex transformations. With Glue ETL, customers can write custom transformation logic, combine data from multiple sources, apply data quality rules, add calculated fields, and perform advanced data cleansing or aggregation. This flexibility makes Glue ETL suitable for scenarios where data must be transformed or enriched before analysis.
It’s worth mentioning that the source connections are reusable between Glue ETL and Glue zero-ETL so that can easily support both patterns. After you create a connection once, you can choose to use the same connection across various AWS Glue components including Glue ETL, Glue Visual ETL and zero-ETL. For example, you might start by creating a connection and a zero-ETL integration, but decide later to use the same connection to create a custom GlueETL pipeline.
This blog post will explore how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP and others.
Use case
Consider a large company that relies heavily on data-driven insights to optimize its customer support processes. The company stores vast amounts of transactional data in ServiceNow. To gain a comprehensive understanding of their business and make informed decisions, the company needs to integrate and analyze data from ServiceNow seamlessly, identifying and addressing problems and root causes, managing service level agreements and compliance, and proactively planning for incident prevention.
The company is looking for an efficient, scalable, and cost-effective solution to collecting and ingesting data from ServiceNow, ensuring continuous near real-time replication, automated availability of new data attributes, robust monitoring capabilities to track data load statistics, and reliable data lake foundation supporting data versioning. This allows data analysts, data engineers, and data scientists to quickly explore ingested data and develop data products that meet the needs of business teams.
Solution overview
The following architecture diagram illustrates an efficient and scalable solution for collecting and ingesting replicated data from ServiceNow with zero-ETL integration. In this example we use ServiceNow as a source, but this can be done with any supported source such as Salesforce, Zendesk, SAP, or others. The AWS Glue managed connectors act as a bridge between ServiceNow and the target Amazon SageMaker Lakehouse, enabling seamless, near real-time data flow without the need for custom ETL and scheduling.
The following are the key components and steps in the integration process:
- Zero-ETL extracts and loads the data into Amazon S3, a highly scalable object storage service. The data is also registered in the Glue Data Catalog, a metadata repository. Additionally, it keeps the information synchronized by capturing changes that occur in ServiceNow and maintains data consistency by automatically performing schema evolution.
- Amazon CloudWatch, a monitoring and observability service, collects logs and metrics from the data integration process.
- Amazon EventBridge, a serverless event bus service, triggers a downstream process that allows you to build event-driven architecture as soon as your new data arrives in your target. Through EventBridge, customers can build on top of zero-ETL for a diverse set of use cases such as:
- Trigger Glue ETL to perform transformations and aggregations on the data to create specific analysis.
- Trigger a Directed Acyclic Graph (DAG) in Amazon Managed Workflows for Apache Airflow (Amazon MWAA).
- Trigger a state machine in AWS Step Functions.
- Notify the status of replications and their details to downstream applications.
Prerequisites
Complete the following prerequisites before setting up the solution:
- Create a bucket in Amazon S3 called
zero-etl-demo-<your AWS Account Number>-<AWS Region>
(for example,zero-etl-demo-012345678901-us-east-1
). The bucket will be used to store the data ingested by zero-ETL in Apache Iceberg which is an open table format (OTF) supporting ACID transactions (atomicity, consistency, isolation, and durability), seamless schema evolution, and data versioning using time travel. - Create an AWS Glue database
<your database name>
, such aszero_etl_demo_db
and associate the S3 bucketzero-etl-demo-<your AWS Account Number>-<AWS Region>
as a location of the database. The database will be used to store the metadata related to the data integrations performed by zero-ETL. - Update AWS Glue Data Catalog settings using the following IAM policy for fine-grained access control of the data catalog for zero-ETL.
- Create an AWS Identity and Access Management (IAM) role named
zero_etl_demo_role
. The IAM role will be used by zero-ETL to access the Glue Connector to read from the Service Now and write the data into the target. Optionally, you can create two separate IAM roles (one associated with your source data and another associated with your target). - Make sure you have a ServiceNow instance named
ServiceNowInstance
, a user namedServiceNowUser
, and a passwordpasswordServiceNowPassword
with the required permissions to read from ServiceNow. The instance name, user, and password are used in the AWS Glue connection to authenticate within ServiceNow using the BASIC authentication type. Optionally, you can choose OAUTH2 if your ServiceNow supports it. - Create the secret
zero_etl_demo_secret
in AWS Secrets Manager to store ServiceNow credentials.
Build and verify the zero-ETL integration
Complete the following steps to create and validate zero-ETL integration:
Step 1: Set up a connector
Zero-ETL integration, when used with AWS Glue natively supported applications connectors, provides a straightforward way to bring third-party data into an Amazon S3 transactional data lake or Amazon Redshift. Use the following steps to create a ServiceNow data connection:
- Open the AWS Glue console.
- In the navigation pane, under Data catalog, choose Connections.
- Choose Create Connection.
- In the Create Connection pane, enter ServiceNow in Data Sources.
- Choose ServiceNow.
- Choose Next.
- For Instance Name, enter
ServiceNowInstance
(created as part of the prerequisites). - For IAM service role, choose the
zero_etl_demo_role
(created as part of the prerequisites). - For Authentication Type, choose the authentication type that you’re using for ServiceNow. In this example. we have chosen OAUTH2, which requires the set up of Application Registries in ServiceNow.
- For AWS Secret, choose the secret zero_etl_demo_secret (created as part of the prerequisites).
- Choose Next.
- In the Connection Properties section, for Name, enter
zero_etl_demo_conn
. - Choose Next.
- Choose Create connection.
- There will be a popup from ServiceNow after you choose Create connection. Choose Allow.
Step 2: Set up Zero-ETL integration
After creating the data connection to ServiceNow, use the following steps to create the zero-ETL integration:
- Open the AWS Glue console.
- In the navigation pane, under Data catalog, choose Zero-ETL integrations.
- Choose Create zero-ETL integration.
- In the Create integration pane, enter
ServiceNow
in Data Sources. - Choose ServiceNow.
- Choose Next.
- For ServiceNow connection, choose the data connection created on Step 1—zero_etl_demo_conn.
- For Source IAM role, choose the zero_etl_demo_role (from the prerequisites).
- For ServiceNow objects, choose the objects you want to perform the ingestion managed by zero-ETL integration. For this post, choose problem and incident objects.
- For Namespace or Database, choose
<your database name>
. In this example, we use the zero_etl_demo_db (from the prerequisites). - For Target IAM role, choose the zero_etl_demo_role (from the prerequisites).
- Choose Next.
- For Security and data encryption, you can choose either AWS Managed KMS Key or choose a customer KMS key managed by AWS Key Management Service. For this post, choose Use AWS managed KMS key.
- In the Integration details section, for Name, enter
zero-etl-demo-integration
. - Choose Next.
- Review the details and choose Create and launch integration.
- The newly created integration will show as Active in about a minute.
Step 3: Verify the initial SEED load
The SEED load refers to the initial loading of the tables that you want to ingest into an Amazon SageMaker Lakehouse using zero-ETL integration. The status and statistics of the SEED load are published into CloudWatch and the data ingested by zero-ETL integration can be accessed in AWS using a set of services such Amazon Sagemaker Unified Studio, Amazon QuickSight, and others. Use the following steps to access zero-ETL integration logs and query the data:
- Open the AWS Glue console.
- In the navigation pane, choose Zero-ETL integrations.
- In the Zero-ETL integrations section, choose zero-etl-demo-integration.
- In the Activity summary (all time) section, choose CloudWatch logs.
- Check CloudWatch log events for the SEED Load. For each table ingested by the zero-ETL integration, two groups of logs are created: status and statistics. Highlighted in the following screenshot in IngestionTableStatistics are the statistics. The insertCount represents how many rows were extracted and loaded by zero-ETL integration. For the SEED load, you will always see only insertCount because it’s the initial load. In addition, in IngestionCompleted you will find information about the Zero-ETL integration such as status, load type, and message.
To validate the SEED load, query the data using Amazon Sagemaker Unified Studio.
- Access Amazon Sagemaker Unified Studio for your specific domain through your AWS Console.
- Open the Amazon SageMaker Unified Studio URL.
- Sign in with SSO or AWS IAM user.
- Select your project.
- Go to Data from the left menu, expand the Lakehouse
AWSDataCatalog
, expand your database, and select the incident table. Click the⋮
icon and select Query with Athena. - For Query, enter the following statement:
- Choose Run.
- Let’s check an existing incident in ServiceNow. This is the incident that you will update the description of in ServiceNow to validate change data capture (CDC). In the query editor, pane, for Query, enter the following statement:
- Choose Run.
Step 4: Validate CDC
The CDC load is a technique used to identify and process only the data that has changed in a source system since the last extraction. Instead of reloading an entire dataset, CDC captures and transfers only the new, updated, or deleted records into the target system, making data processing more efficient and reducing load times. The status and statistics of the CDC load are published into CloudWatch. For this post, you will use Amazon SageMaker unified studio to query the data ingested. Use the following steps to access zero-ETL integration logs and query the data ingested. For the next step in this example, you will select an incident and perform an update in ServiceNow, changing the short_description
and description of the incident.
- To demonstrate CDC event, in this blog we are going to edit 1 incident and delete 1 incident in ServiceNow.
- Open the AWS Glue console.
- In the navigation pane, under Data catalog, choose Zero-ETL integrations.
- In the Zero-ETL integrations section, choose zero-etl-demo-integration.
- In the Activity summary (all time) section, choose CloudWatch logs.
- Zero-ETL integration replicates the changes to the Amazon S3 transactional data lake every 60 minutes by default. Check CloudWatch log events for the CDC load. Shown in the following figure in IngestionTableStatistics, review updateCount and deleteCount for each specific object managed by zero-ETL integration. It’s applying the updates and deletes that happened in ServiceNow to the transactional data lake.
To validate the CDC load, query the data using Amazon SageMaker Unified Studio.
- You can go back to Amazon SageMaker Unified Studio.
- For Query, enter the following statement:
- For Query, enter the following statement to record initial snapshot results before CDC:
- Choose Run and confirm that one record was updated in short_description and description attributes.
By following these steps, you can effectively set up, build, and verify a zero-ETL job using the new AWS Glue application connector for ServiceNow. This process demonstrates the simplicity and efficiency of the zero-ETL approach in integrating applications data into your AWS environment.
Apache Iceberg Time Travel: Enhancing data versioning in zero-ETL
One of the benefits of using Apache Iceberg in zero-ETL integration is the ability to perform Time Travel. This feature allows you to access and query historical versions of your data effortlessly. With Iceberg Time Travel, you can easily roll back to previous data states, compare data across different points in time, or recover from accidental data changes. In the context of zero-ETL integrations, this capability becomes particularly valuable when dealing with rapidly changing applications data.
To demonstrate this feature, let’s consider a scenario where you’re analyzing ServiceNow incident data ingested through zero-ETL integration using Amazon SageMaker Unified Studio. Here’s an example query that showcases Iceberg time travel:
In this example:
- The first query uses the
FOR TIMESTAMP AS OF
clause for time travel queries on Iceberg tables. It retrieves incident data as it existed before CDC update for the specific incident numberINC0000003
. - The second query fetches the current state of the data for the same incident number.
This capability allows you to track the evolution of incidents, identify trends in resolution times, or recover information that may have been inadvertently altered.
Clean up
To avoid incurring future charges, remove up the resources used in this post from your AWS account by completing the following steps:
- Delete zero-ETL integration
zero-etl-demo-integration
. - Delete content from the S3 bucket
zeroetl-etl-demo-<your AWS Account Number>-<AWS Region>
. - Delete the Data Catalog database
zero_etl_demo_db
. - Delete the Data Catalog connection
zero_etl_demo_conn
. - Delete the AWS Secrets manager Secret.
Conclusion
As the pace of business continues to accelerate, the ability to quickly and efficiently integrate data from various applications and enterprise platforms has become a critical competitive advantage. By adopting a zero-ETL integration powered by AWS Glue and its new set of managed connectors, you organization can unlock the full potential of its data across multiple platforms faster and stay ahead of the curve.
To learn more about how AWS Amazon SageMaker Lakehouse can help your organization streamline its data integration efforts, visit Amazon SageMaker Lakehouse.
Get started with zero-ETL on AWS by creating a free account today!
About the authors
Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure and high-performance data solutions in the cloud.
Vivek Pinyani is a Data Architect at AWS Professional Services with expertise in Big Data technologies. He focuses on helping customers build robust and performant Data Analytics solutions and Data Lake migrations. In his free time, he loves to spend time with his family and enjoys playing cricket and running.
Kartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed multiple marathons, he is currently preparing for his next marathon challenge.
Caio Sgaraboto Montovani is a Sr. Specialist Solutions Architect, Data Lake and AI/ML within AWS Professional Services, developing scalable solutions according customer needs. His vast experience has helped customers in different industries such as life sciences and healthcare, retail, banking, and aviation build solutions in data analytics, machine learning, and generative AI. He is passionate about rock and roll and cooking and loves to spend time with his family.
Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!