How to Use Amazon SageMaker Pipelines MLOps with Gretel Synthetic Data

By Maarten Van Segbroeck, Principal Scientist – Gretel
By Ben McCown, Sr. Software Engineer – Gretel
By Johnny Greco, Sr. Applied Scientist – Gretel
By Qiong Zhang and Michael Tindal – AWS

Gretel

Collecting large volumes of high quality, labeled datasets can be challenging due to costs, time, and privacy concerns. Gretel’s synthetic data platform has emerged as a solution to these issues, and its role is vital in machine learning operations (MLOps), especially to address tightening privacy laws and constrained resources.

Gartner forecasts that synthetic data will dominate artificial intelligence (AI) model development by 2030. Gretel’s synthetic data solution, combined with Amazon SageMaker Pipelines, empowers data scientists and ML engineers to deal with data scarcity and complex workflows. It also guides ML leaders in adopting AI responsibly within their organization.

This post discusses how to integrate Gretel with Amazon SageMaker Pipelines to enhance ML training, prioritizing privacy and safety. SageMaker Pipelines streamlines all ML stages, from data pre-processing to model deployment.

The Gretel MLOps library’s source code showcases this integration, enabling training on synthetic data or augmenting real data with synthetic data to accelerate the ML model production process.

Gretel is an AWS Partner and AWS Marketplace Seller that enables the development of domain-specific AI models for creating data that mirrors, boosts, or simulates real-world data without the privacy concerns.

Benefits of Synthetic Data in Machine Learning

Synthetic data is artificially generated data mimicking the statistical characteristic of real-world data. It has several benefits for MLOps:

Privacy protection: Synthetic data contains no real user information. This protects individuals’ privacy and helps organizations comply with data privacy regulations like GDPR and HIPAA.
Data availability: Synthetic data models support quick generation of large datasets, which helps deal with scarce or incomplete real data.
Bias mitigation: Using synthetic data reduces inherent biases in real data.
Cost efficiency: Generating synthetic data can cost less than gathering and labeling new real data.

Gretel’s Deployment Modes

Gretel provides two deployment options: Gretel Cloud, a hassle-free software-as-a-service (SaaS) solution requiring no deployment effort, and Gretel Hybrid, which integrates into your cloud environment.

Gretel Cloud is a comprehensive, fully managed service for synthetic data generation. It operates within Gretel’s cloud compute infrastructure, and handles all aspects of compute, automation, and scalability. It provides a seamless solution that simplifies the technical demands of setting up your cloud infrastructure.

Gretel Hybrid functions within your AWS environment using Amazon Elastic Kubernetes Service (Amazon EKS) and ensures your data remains within your AWS account. It interfaces with the Gretel API only for job scheduling and metadata, and is particularly well-suited for handling sensitive or regulated data that must stay within your cloud tenant’s boundaries.

Gretel Hybrid combines the benefits of using your infrastructure for training synthetic data models with Gretel’s advanced tools, offering a balance of control and convenience.

A high-level architecture diagram for Gretel Hybrid is shown below. You’ll find comprehensive information in the Gretel Hybrid documentation. To deploy Gretel Hybrid, follow the instructions in this blog post to generate synthetic data using Gretel Hybrid.

Figure 1 – High-level architecture of the Gretel Hybrid deployment in AWS.

Solution Overview

The diagram below illustrates the Amazon SageMaker Pipeline process. Gretel’s synthetic data generation follows the data preparation phase, and this synthetic data is utilized in the training phase of the ML model.

Figure 2 – MLOps workflow with SageMaker Pipelines and Gretel.

Prerequisites

Integrate Gretel with Amazon SageMaker Pipelines

To follow along, instantiate run_pipeline.ipynb from the Gretel MLOps library in Amazon SageMaker Studio.

Step 1: Set Up Your AWS Environment

First, store your Gretel API key in AWS Secrets Manager. Follow Step 2 – Create Secret for the Gretel API key to retrieve your Gretel API key and store it in AWS Secrets Manager.

The SageMaker IAM role must have the AmazonSageMakerFullAccess permission policy attached. Additionally, the role needs the SecretsManagerReadWrite policy for SageMaker to access AWS Secrets Manager for the Gretel API key.

Step 2: Configure the SageMaker Pipeline

In run_pipeline.ipynb, install the Python package from the Gretel MLOps library by running the following command:

!pip install git+https://github.com/gretelai/gretel-mlops.git

The installed pipeline package is versatile enough to handle many datasets and optimize for standard classification or regression ML metrics. To customize the pipeline, supply a yaml configuration file that has three sections: dataset, ML, and gretel.

Example MLOps configuration files are available for multiple datasets. The example below uses a healthcare dataset:

dataset:
  name: healthcare-stroke-data
  train_path: 's3://gretel-datasets/ml_ops/stroke/train.csv'
  validation_path: null
  test_path: null
  target_column: stroke
  drop_columns: id
ML:
  ml_task: classification
  objective: 'binary:logistic'
  objective_type: Maximize
  ml_eval_metric: f1
  ml_deployment_threshold: 0.6
gretel:
  strategy: balance
  generate_factor: 1
  mode: cloud
  sink_bucket: null

The ML section defines the machine learning task as classification or regression and sets the optimization function’s objective. Choose an evaluation metric to maximize or minimize. Set a deployment threshold to ensure only adequately performing models are registered in the SageMaker Model Registry.

The gretel section covers using synthetically generated data. Set the strategy parameter, and determine the amount of synthetic data with generate_factor. Choose the training mode to be cloud (Gretel Cloud) or hybrid (AWS). For hybrid training, you must provide sink_bucket configured in Gretel Hybrid deployment.

Step 3: Define and Run the SageMaker Pipeline

The following code block is used to define the pipeline:

from gretel_mlops.aws.sagemaker.pipeline import get_pipeline

model_package_group_name = f"GretelModelPackageGroup-{config['dataset']['name']}"
pipeline_name = f"GretelPipeline-{config['dataset']['name']}"
print(f"Initiating {pipeline_name}")

pipeline = get_pipeline(
    region=region,
    role=role,
    default_bucket=default_bucket,
    model_package_group_name=model_package_group_name,
    pipeline_name=pipeline_name,
    config=config,
)

Once defined, create or update the pipeline and start the pipeline execution:

pipeline.upsert(role_arn=role)
train_execution = pipeline.start()

Check the status of your pipeline execution using a workflow graph. Choose Home > Pipelines > your pipeline name > Graph tab to display the workflow graph. The graph created by the Gretel MLOps library is shown below.

Figure 3 – SageMaker Pipeline workflow graph using Gretel.

Step 4: Evaluate the Final Results

After the pipeline is successfully executed, retrieve the ML evaluation report. It shows machine learning metrics evaluated on the test set.

The following code block shows how to retrieve the evaluation report:

import json

s3_client = boto3.client('s3')
s3_path_report = f"{pipeline.steps[3].arguments['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']}/evaluation.json"
bucket_name = s3_path_report.replace("s3://", "").split('/', 1)[0]
file_key = s3_path_report.replace("s3://", "").split('/', 1)[1]

# Fetch the file from S3
response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
content = response['Body'].read()

# Parse the JSON content
data = json.loads(content)

# Pretty print the JSON data
print(json.dumps(data, indent=4))

An example evaluation report is shown below:

{
  "metrics": {
    "auc": {
      "value": 0.7368312757201646
    },
    "aucpr": {
      "value": 0.12997917776590717
    },
    "precision": {
      "value": 0.136986301369863
    },
    "recall": {
      "value": 0.4
    },
    "f1": {
      "value": 0.2040816326530612
    },
    "confusion_matrix": {
      "value": [
        [846,126],
        [ 30, 20]
      ]
    }
  }
}

Deep Dive on Individual Steps

Preprocessing Step

This step from the Gretel MLOps library prepares the data for subsequent phases and includes:

Data preparation: Loads data from configuration files and identifies feature and target columns. It also eliminates excluded columns.
Data splitting and preprocessing: Splits data into training, validation, and test sets based on available paths. Note that the training data, intended for the Gretel phase, requires no transformation. It’s saved as a Gretel training source file.
Data transformation: For the downstream ML model, numeric data undergoes imputation and scaling, while categorical data is subjected to imputation and one-hot encoding. A preprocessing model is fitted to the training data, applying these transformations across all datasets.

Gretel Step

This step focuses on training a generative model on the Gretel training source file and involves the following tasks:

Hyperparameter tuning: Utilizes Gretel Tuner, an optional module of Gretel’s Python SDK, for tuning synthetic model parameters. This involves comprehensive parameter sweeps to find the optimal configuration. It’s tailored to the downstream ML task that uses an XGBoost classifier or regression model, similar to the final ML application.
Generate and use synthetic data: Generates synthetic data to replace, augment, or balance the training data:
- Replace: Uses solely synthetic data for ML training, ideal for privacy-sensitive scenarios.
- Augment: Enhances real training data with synthetic records, enriching the model’s performance with diverse examples.
- Balance: For classification tasks, creates synthetic data for underrepresented classes to address class imbalance and improve model accuracy and fairness.
Synthetic data quality reports: Gretel provides reports with synthetic data quality metric that are accessible via:

sqs_scores = f”{pipeline.steps[1].arguments['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']}/report_quality_scores.txt

Model Training Step

In this step, the focus is on training the downstream ML model:

XGBoost framework: Employs XGBoost for building efficient ML models for classification or regression.
Hyperparameter Tuning: Fine tunes the XGBoost model by finding the hyperparameters that optimize its performance on the validation set. Target metrics depend on the task:
- Classification task: accuracy, f1, auc, aucpr, precision, or recall
- Regression task: mse, rmse, mae, or R2

Evaluation Step

This step assesses the optimal downstream ML model on the test set and compiles an evaluation report with relevant metrics.

Condition and Model Register Step

The last two steps are registering the trained ML model in the model registry, conditional on passing a predefined performance threshold.

Conclusion

The integration of Gretel with Amazon SageMaker Pipelines is a significant advance for machine learning practices. By using Gretel in ML workflows, you develop more robust machine learning models while ensuring compliance with stringent privacy regulations.

The use of Gretel within the AWS framework, especially in a hybrid environment, introduces an extra layer of data privacy protection. This protects sensitive information throughout the entire machine learning lifecycle.

Get started with the Gretel 101 Blueprint to learn the basics of the Gretel SDK and train generative models in the Gretel Cloud. For more advanced work, the Gretel Advanced Tabular Blueprint offers customizable model configurations and conditional synthetic data generation. Explore text generation with the Gretel Text Generation Blueprint, which is ideal for finetuning large language models. To train synthetic data in your own cloud environment, check out Gretel Hybrid.

You can find Gretel available on AWS Marketplace.

.
Gretel-APN-Blog-Connect-2024
.

Gretel – AWS Partner Spotlight

Gretel is an AWS Partner that enables the development of domain-specific AI models for creating data that mirrors, boosts, or simulates real-world data without the privacy concerns.

Contact Gretel | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog