Amazon SageMaker with TensorBoard: An overview of a hosted TensorBoard experience

Today, data scientists who are training deep learning models need to identify and remediate model training issues to meet accuracy targets for production deployment, and require a way to utilize standard tools for debugging model training. Among the data scientist community, TensorBoard is a popular toolkit that allows data scientists to visualize and analyze various aspects of their machine learning (ML) models and training processes. It provides a suite of tools for visualizing training metrics, examining model architectures, exploring embeddings, and more. TensorFlow and PyTorch projects both endorse and use TensorBoard in their oﬃcial documentation and examples.

Amazon SageMaker with TensorBoard is a capability that brings the visualization tools of TensorBoard to SageMaker. Integrated with SageMaker training jobs and domains, it provides SageMaker domain users access to the TensorBoard data and helps domain users perform model debugging tasks using the SageMaker TensorBoard visualization plugins. When they create a SageMaker training job, domain users can use TensorBoard using the SageMaker Python SDK or Boto3 API. SageMaker with TensorBoard is supported by the SageMaker Data Manager plugin, with which domain users can access many training jobs in one place within the TensorBoard application.

In this post, we demonstrate how to set up a training job with TensorBoard in SageMaker using the SageMaker Python SDK, access SageMaker TensorBoard, explore training output data visualized in TensorBoard, and delete unused TensorBoard applications.

Solution overview

A typical training job for deep learning in SageMaker consists of two main steps: preparing a training script and conﬁguring a SageMaker training job launcher. In this post, we walk you through the required changes to collect TensorBoard-compatible data from SageMaker training.

Prerequisites

To start using SageMaker with TensorBoard, you need to set up a SageMaker domain with an Amazon VPC under an AWS account. Domain user proﬁles for each individual user are required to access the TensorBoard on SageMaker, and the AWS Identity and Access Management (IAM) execution role needs a minimum set of permissions, including the following:

sagemaker:CreateApp
sagemaker:DeleteApp
sagemaker:DescribeTrainingJob
sagemaker:Search
s3:GetObject
s3:ListBucket

For more information on how to set up SageMaker Domain and user proﬁles, see Onboard to Amazon SageMaker Domain Using Quick setup and Add and Remove User Proﬁles .

Directory structure

When using Amazon SageMaker Studio, the directory structure can be organized as follows:

.
├── script
│	└── train.py
└── simple_tensorboard.ipynb

Here, script/train.py is your training script, and simple_tensorboard.ipynb launches the SageMaker training job.

Modify your training script

You can use any of the following tools to collect tensors and scalars: TensorBoardX, TensorFlow Summary Writer, PyTorch Summary Writer, or Amazon SageMaker Debugger, and specify the data output path as the log directory in the training container (log_dir). In this sample code, we use TensorFlow to train a simple, fully connected neural network for a classiﬁcation task. For other options, refer to Prepare a training job with a TensorBoard output data configuration. In the train() function, we use the tensorflow.keras.callbacks.TensorBoard tool to collect tensors and scalars, specify /opt/ml/output/tensorboard as the log directory in the training container, and pass it to model training callbacks argument. See the following code:

import argparse
import json
import tensorflow as tf


def parse_args():
    cmdline = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    cmdline.add_argument("--epochs", default=5, type=int, help="""Number of epochs.""")
    cmdline.add_argument(
        "--optimizer", default="adam", type=str, help="""Optimizer type"""
    )
    cmdline.add_argument(
        "--loss",
        default="sparse_categorical_crossentropy",
        type=str,
        help="""Optimizer type""",
    )
    cmdline.add_argument(
        "--metrics",
        action="store",
        dest="metrics",
        type=json.loads,
        default="['accuracy']",
        help="List of metrics to be evaluated by the model during training and testing.",
    )
    return cmdline


def create_model():
    return tf.keras.models.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(512, activation="relu"),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(10, activation="softmax"),
        ]
    )


def train(args):
    mnist = tf.keras.datasets.mnist
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0

    model = create_model()
    model.compile(optimizer=args.optimizer, loss=args.loss, metrics=args.metrics)

    # setup TensorBoard Callback
    LOG_DIR = "/opt/ml/output/tensorboard"
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=LOG_DIR,
        histogram_freq=1,
        update_freq=1,
        embeddings_freq=5,
        write_images=True,
    )

    # pass TensorBoard Callback into the Model fit
    model.fit(
        x=x_train,
        y=y_train,
        epochs=args.epochs,
        validation_data=(x_test, y_test),
        callbacks=[tensorboard_callback],
    )


if __name__ == "__main__":
    cmdline = parse_args()
    args, unknown_args = cmdline.parse_known_args()
    train(args)

Construct a SageMaker training launcher with a TensorBoard data conﬁguration

Use sagemaker.debugger.TensorBoardOutputConfig while conﬁguring a SageMaker framework estimator, which maps the Amazon Simple Storage Service (Amazon S3) bucket you specify for saving TensorBoard data with the local path in the training container (for example, /opt/ml/output/tensorboard). You can use a diﬀerent container local output path. However, it must be consistent with the value of the LOG_DIR variable, as speciﬁed in the previous step, to have SageMaker successfully search the local path in the training container and save the TensorBoard data to the S3 output bucket.

Next, pass the object of the module to the tensorboard_output_config parameter of the estimator class. The following code snippet shows an example of preparing a TensorFlow estimator with the TensorBoard output conﬁguration parameter.

The following is the boilerplate code:

import os
from datetime import datetime
import boto3
import sagemaker

time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")

region = boto3.session.Session().region_name
boto_sess = boto3.Session()
role = sagemaker.get_execution_role()
sm = sagemaker.Session()

base_job_name = "simple-tensorboard"
date_str = datetime.now().strftime("%d-%m-%Y")
time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
job_name = f"{base_job_name}-{time_str}"

s3_output_bucket = os.path.join("s3://", sm.default_bucket(), base_job_name)

output_path = os.path.join(s3_output_bucket, "sagemaker-output", date_str, job_name)
code_location = os.path.join(s3_output_bucket, "sagemaker-code", date_str, job_name)

The following code is for the training container:

instance_type = "ml.c5.xlarge"
instance_count = 1

image_uri = sagemaker.image_uris.retrieve(
    framework="tensorflow",
    region=region,
    version="2.11",
    py_version="py39",
    image_scope="training",
    instance_type=instance_type,
)

The following code is the TensorBoard configuration:

from sagemaker.tensorflow import TensorFlow

tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=os.path.join(output_path, "tensorboard"),
    container_local_output_path="/opt/ml/output/tensorboard",
)

hyperparameters = {
    "epochs": 5,
    "optimizer": "adam",
    "loss": "sparse_categorical_crossentropy",
    "metrics": "'[\"accuracy\"]'",
}

estimator = TensorFlow(
    entry_point="train.py",
    source_dir="script",
    role=role,
    image_uri=image_uri,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    base_job_name=job_name,
    tensorboard_output_config=tensorboard_output_config,
    hyperparameters=hyperparameters,
)

Launch the training job with the following code:

estimator.fit(
    inputs=None,
    wait=False,
    job_name=job_name,
)

Access TensorBoard on SageMaker

You can access TensorBoard with two methods: programmatically using the sagemaker.interactive_apps.tensorboard module that generates the URL or using the TensorBoard landing page on the SageMaker console. After you open TensorBoard, SageMaker runs the TensorBoard plugin and automatically ﬁnds and loads all training job output data in a TensorBoard-compatible ﬁle format from S3 buckets paired with training jobs during or after training.

The following code autogenerates the URL to the TensorBoard console landing page:

from sagemaker.interactive_apps import tensorboard
app = tensorboard.TensorBoardApp(region)

print("Navigate to the following URL:")
if app._is_studio_user:
    print(app.get_app_url(job_name))
else:
    print(app.get_app_url())

This returns the following message with a URL that opens the TensorBoard landing page.

>>> Navigate to the following URL: https://<sagemaker_domain_id>.studio.<region>.sagemaker.aws/tensorboard/default/data/plugin/sa

For opening TensorBoard from the SageMaker console, please refer to How to access TensorBoard on SageMaker.

When you open the TensorBoard application, TensorBoard opens with the SageMaker Data Manager tab. The following screenshot shows the full view of the SageMaker Data Manager tab in the TensorBoard application.

On the SageMaker Data Manager tab, you can select any training job and load TensorBoard-compatible training output data from Amazon S3.

In the Add Training Job section, use the check boxes to choose training jobs from which you want to pull data and visualize for debugging.
Choose Add Selected Jobs.

The selected jobs should appear in the Tracked Training Jobs section.

Refresh the viewer by choosing the refresh icon in the upper-right corner, and the visualization tabs should appear after the job data is successfully loaded.

Explore training output data visualized in TensorBoard

On the Time Series tab and other graphics-based tabs, you can see the list of Tracked Training Jobs in the left pane. You can also use the check boxes of the training jobs to show or hide visualizations. The TensorBoard dynamic plugins are activated dynamically depending on how you have set your training script to include summary writers and pass callbacks for tensor and scalar collection, and the graphics tabs also appear dynamically. The following screenshots show example views of each tab with visualizations of the collected metrics of two training jobs. The metrices include time series, scalar, graph, distribution, and histogram plugins.

The following screenshot is the Time Series tab view.

The following screenshot is the Scalars tab view.

The following screenshot is the Graphs tab view.

The following screenshot is the Distributions tab view.

The following screenshot is the Histograms tab view.

Clean up

After you are done with monitoring and experimenting with jobs in TensorBoard, shut the TensorBoard application down:

On the SageMaker console, choose Domains in the navigation pane.
Choose your domain.
Choose your user proﬁle.
Under Apps, choose Delete App for the TensorBoard row.
Choose Yes, delete app.
Enter delete in the text box, then choose Delete.

A message should appear at the top of the page: “Default is being deleted”.

Conclusion

TensorBoard is a powerful tool for visualizing, analyzing, and debugging deep learning models. In this post, we provide a guide to using SageMaker with TensorBoard, including how to set up TensorBoard in a SageMaker training job using the SageMaker Python SDK, access SageMaker TensorBoard, explore training output data visualized in TensorBoard, and delete unused TensorBoard applications. By following these steps, you can start using TensorBoard in SageMaker for your work.

We encourage you to experiment with diﬀerent features and techniques.

About the authors

Dr. Baichuan Sun is a Senior Data Scientist at AWS AI/ML. He is passionate about solving strategic business problems with customers using data-driven methodology on the cloud, and he has been leading projects in challenging areas including robotics computer vision, time series forecasting, price optimization, predictive maintenance, pharmaceutical development, product recommendation system, etc. In his spare time he enjoys traveling and hanging out with family.

Manoj Ravi is a Senior Product Manager for Amazon SageMaker. He is passionate about building next-gen AI products and works on software and tools to make large-scale machine learning easier for customers. He holds an MBA from Haas School of Business and a Masters in Information Systems Management from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

AWS Machine Learning Blog