Containers
Part 1: Introduction to observing machine learning workloads on Amazon EKS
This post was jointly authored by Elamaran Shanmugam (Senior Partner Specialist SA), Sanjeev Ganjihal (Senior Specialist SA), and Steven David (Principal SA).
Introduction
In this first part of a four-part series, titled Observability of MLOps on Amazon EKS, you get an overview of Machine Learning operations(MLOps) on Amazon Elastic Kubernetes Service(Amazon EKS). This includes understanding the relevant personas, learning essential metrics, and best practices to consider for the observability of machine learning(ML).
MLOps is a set of practices that aim to streamline the deployment, observability, and maintenance of ML models in production environments. In the context of Amazon EKS, MLOps plays a crucial role in making sure of the efficient and scalable management of ML workloads. This enables organizations to rapidly iterate and update their ML models, leading to improved decision-making, enhanced user experiences, and increased operational efficiencies. Amazon EKS provides a robust platform for orchestrating containerized applications, such as ML models.
MLOps on Amazon EKS involves automating the deployment process, scaling resources based on demand, and monitoring the performance of ML models. This streamlined approach allows organizations to maintain their models’ predictive accuracy, adaptability to shifting data and market landscapes, and real-world effectiveness. This is achieved through continuous monitoring of model performance and data quality, incorporating new data, features, or algorithms, retraining models to address concept drift, and making sure of fairness, unbiasedness, and transparency in model decision-making.
However, operating ML models at scale presents several challenges:
- Model drift as real-world data changes over time, models can become less accurate, necessitating frequent retraining and updates.
- Resource management issues as ML workloads often have varying computational demands, making efficient resource allocation crucial.
- Data quality and integrity as making sure of the consistency and reliability of input data is essential for maintaining model performance.
- Challenges meeting compliance and governance.
- Versioning and reproducibility, as it is tough for users to keep track of model versions, datasets, and experiments, which becomes increasingly difficult at scale.
Observability is a critical aspect of MLOps that addresses these challenges. It enables teams to gain insights into the behavior and performance of their ML models, along with the underlying infrastructure in production. Reviewing metrics, logs, and traces allows teams to:
- Identify and diagnose issues quickly.
- Optimize resource usage.
- Make sure of compliance with regulatory requirements.
- Monitor model performance.
- Detect drifts.
- Track data quality and integrity.
Furthermore, a feedback loop is essential for operating ML models and workloads at scale. This loop involves continuously monitoring model performance, collecting data from production environments, and using this data to retrain and improve the models. Incorporating feedback from the observability data gathered for a workload allows organizations to make sure that their ML models remain accurate and effective across various artificial intelligence (AI) and ML use cases. Over time, this feedback also enables models to adapt to changing conditions, such as shifts in user behavior, new data patterns, or emerging trends, while maintaining their predictive power and relevance. As a result, organizations can enhance the overall effectiveness of their ML workloads by driving better decision making, improving user experiences, and increasing business value.
Future posts for this series show step-by-step guidance covering end-to-end observability for ML infrastructure on Amazon EKS. You can learn the essential aspects of monitoring ML, and observe the fundamental difference between monitoring ML and non-ML workloads. The following outline shows what is covered in each post of this four-part series:
- Introduction of observability of MLOps on Amazon EKS
- Establishing observability for MLOps infrastructure
- Observing ML models to allow for model insights
- Getting insights into MLOps costs
For this series, the following MLOps flow is observed:

Figure 1: MLOps flow diagram
Essential aspects of monitoring ML workloads on Amazon EKS
Monitoring ML workloads on Amazon EKS is necessary for maintaining the health, performance, and reliability of ML models in production. It involves collecting and analyzing metrics, logs, and traces to gain visibility into the behavior of ML models and the underlying infrastructure.
When monitoring ML workloads on Amazon EKS, consider the following key aspects:
- Resource usage
- Latency
- Throughput
- Model performance metrics
- Data quality and drift
- Error rates and failures
We review the metrics that are most critical for your MLOps. These are often called your workload’s golden metrics.
Monitoring the resource usage of ML workloads, such as CPU, memory, and Graphics Processing Unit (GPU), is a part of these golden metrics. This is especially true for GPU usage, as many ML models rely heavily on GPUs to accelerate complex computations and handle large datasets. Monitoring GPU usage allows you to make sure that the allocated resources are:
- Sufficient to handle the workload.
- Not over-allocated, so that resources are being wasted.
- Able to identify any bottlenecks, contention, or underusage of resources.
- Optimizing your ML workloads for better performance and efficiency.
Latency and throughput of ML models are key metrics to monitor. You can use these metrics to make sure that the models can handle the expected load and provide timely responses. You must also understand the metrics that indicate the performance of ML models, such as accuracy, precision, recall, and F1 score. Track these metrics over time to identify any degradation in model performance.
To make sure of the quality and reliability of your ML models, you must track metrics that measure the quality and statistical properties of the input data. Detecting data drift or anomalies that could impact model performance is crucial for maintaining consistent results over time. Moreover, it’s essential to monitor error rates and failures throughout the ML pipeline, such as data preprocessing, model training, and inference. Identifying and investigating recurring errors or anomalies, you can maintain a high-quality model and make sure it continues to perform well.
Fundamental differences between monitoring ML and non-ML workloads on Amazon EKS
Monitoring ML workloads differs from monitoring non-ML workloads, due to the unique characteristics and challenges associated with ML. ML workloads have more complexities as compared to traditional non-ML workloads:
ML workloads exhibit distinct resource usage patterns, such as the following:
- Sustained high GPU usage during training, contrasting with the more consistent resource consumption of traditional applications.
- Specialized scheduling on GPU-enabled nodes, unlike typical short-lived or nightly batch jobs.
- Tracking model versioning and performance metrics such as accuracy and F1 scores, which are irrelevant for standard applications.
- Data quality and drift monitoring, as well as tracking shifts in user preferences for recommendation systems, whereas non-ML workloads focus more on data integrity.
- Specialized performance metrics and continuous improvement through feedback loops, such as analyzing chatbot interactions for targeted enhancements.
- Resource scaling for ML often involves GPU nodes to handle fluctuating inference requests, different from CPU-based scaling in traditional setups.
- More frequent and granular observations, sometimes on a per-prediction basis, to maintain model accuracy and performance in production environments.
ML models evolve at a faster rate than most other workload types. A key aspect of MLOps is the implementation of a continuous feedback loop, which enables the incorporation of real-world data and user interactions to improve model performance over time. This feedback loop necessitates monitoring and analyzing feedback data, allowing data scientists to refine their models and make sure they remain accurate and effective in real-world scenarios.
Personas involved in MLOps monitoring
Several personas are involved in MLOps monitoring, and effective collaboration and communication among them is essential. Regular meetings, shared dashboards, and well-defined processes help make sure that everyone is aligned and working toward common goals.
- Business stakeholders: The business stakeholder is involved in the MLOps process to make sure of the alignment of ML initiatives with organizational goals. Their primary responsibility is to define the business metrics and key performance indicators (KPIs) that the ML models should impact. This clear articulation of desired outcomes guides the data scientists and ML engineers in developing and optimizing models that drive tangible business value. Collaboration with these technical teams is essential to make sure that model performance is consistently aligned with overarching business objectives. Furthermore, they play a pivotal role in making data-driven decisions based on the insights provided by the ML models and monitoring data. Using these insights allows them to make informed strategic decisions, drive operational efficiencies, and ultimately enhance the organization’s competitive advantage.
- Data engineers: The data engineer is responsible for designing and implementing data pipelines that enable the collection and processing of data for both model training and inference phases. They work closely with data scientists and ML engineers to make sure that the data is in a format suitable for ML algorithms, and that data quality and availability standards are maintained. Moreover, they play a vital role in enabling the collection of data quality metrics and logs, which are used to monitor data pipelines and detect data drift and anomalies. Maintaining robust data pipelines and making sure of consistent data quality allows them to contribute to the overall reliability and accuracy of the ML models deployed in production.
- Data scientists and ml engineers: The data scientist or ML engineer role in MLOps is responsible for defining and monitoring model performance metrics that directly align with the organization’s business objectives. This makes sure that the ML models are delivering tangible value and meeting the desired outcomes. Collaboration with DevOps and operations teams is essential to establish robust monitoring and alerting systems, enabling proactive identification and resolution of any issues. Furthermore, they play a crucial role in analyzing the monitoring data and using domain expertise to identify opportunities for model improvement. They continuously refine and enhance the models based on this analysis to drive better performance, efficiency, and overall business impact.
- DevOps and operations teams: The DevOps professional is responsible for setting up and maintaining the infrastructure needed to run ML workloads on Amazon EKS. This includes implementing robust monitoring and logging systems to collect relevant metrics and logs, and enabling effective tracking and troubleshooting. They make sure of the seamless integration of ML workloads into the overall Amazon EKS ecosystem, facilitating smooth deployment and operation. This role monitors resource usage and performance of the Amazon EKS cluster to maintain optimal efficiency and scalability, making sure that ML workloads have the necessary resources to operate effectively.
Understanding ML infrastructure and workload metrics
Infrastructure metrics play a role in making sure of the smooth operation and optimal performance of ML workloads on Amazon EKS. Key metrics to monitor include Amazon EKS cluster health and resource usage (CPU, memory, and GPU usage), node and pod status and availability, network and disk I/O performance, and auto scaling behavior and efficiency. Tracking these metrics enables proactive identification and resolution of infrastructure-related issues, making sure of reliable and scalable ML deployments. To achieve this, AWS provides a range of observability tools, including Amazon CloudWatch, AWS X-Ray, and Amazon Managed Service for Prometheus. These tools enable the collection, monitoring, and analysis of infrastructure metrics, as well as model metrics from ML frameworks such as TensorFlow, PyTorch, and Scikit-Learn. Moreover, model metrics from ML frameworks are essential for assessing the performance and efficiency of ML models. This includes model training progress and convergence metrics (such as loss and accuracy), model inference performance metrics (latency and throughput), and resource usage by ML frameworks such as Ray, MLflow, Kubeflow, and Metaflow. These metrics provide valuable insights into the training process, inference quality, and resource consumption, enabling data scientists and ML engineers to optimize models, identify bottlenecks, and make sure of efficient resource allocation.
Effective logging and tracing allows for immediate feedback on errors and longer-term deep inspection of the MLOps solution on Amazon EKS. Centralized logging for ML workloads, combined with distributed tracing to track request flows through the ML pipeline, provides comprehensive visibility. Tools such as TensorBoard enable visualizing training metrics, while Amazon Managed Service for Prometheus and Amazon Managed Grafana offer production-grade monitoring and alerting capabilities. This robust toolset empowers teams to gain insights, troubleshoot issues, and optimize ML deployments on Amazon EKS efficiently.
Monitoring during training includes tracking job status, progress, and resource usage, so that the training job can be evaluated in comparison to other training jobs. For inference, monitoring service health, availability, and latency help to demonstrate responsiveness of the inference model and efficiency of the resource usage. Furthermore, model versioning and deployment metrics provide visibility into the lifecycle management process. A/B testing and model comparison metrics enable data scientists to evaluate and optimize model performance iteratively. This comprehensive monitoring approach makes sure of efficient training, reliable inference, and continuous model improvement in MLOps on Amazon EKS.
Metrics needed for a complete feedback loop for ML workloads on Amazon EKS
As well as the infrastructure metrics, you can monitor the metrics for the ML models being trained and deployed. Those model metrics are important for you to create a complete feedback loop for your ML models. To achieve this goal, you must monitor several types of metrics for ML workloads on Amazon EKS. Monitoring these metrics allows ML teams to gain a comprehensive understanding of the performance and behavior of ML workloads. Then, you can enable teams to make data-driven decisions, optimize resource usage, and make sure of the continuous improvement of ML models in production. These metrics can be described as follows:
- The first type of metrics is resource usage metrics. These metrics are vital for ML observability on Amazon EKS, including CPU, memory, and GPU usage by ML workloads, as well as identifying opportunities for resource allocation optimization.
- Next you must focus on the accuracy, precision, recall, F1-score, and other relevant model performance metrics. Tracking these model performance metrics, and business-specific metrics and KPIs, is essential for understanding the model’s performance.
- Another class of metrics is data quality and drift metrics. These metrics are key for ML observability on Amazon EKS, including statistical properties of input data over time, data drift detection and alerts, and data quality and validation metrics.
- The last group ML workload metrics you should monitor is latency and throughput metrics. For ML observability on Amazon EKS, monitoring end-to-end latency of ML pipelines, throughput and processing rate of inference requests, and identifying performance bottlenecks and optimization opportunities is essential.
Conclusion
In this first post of the Observability of MLOps on Amazon EKS series, the objective was to present a strong foundation for understanding the unique challenges and requirements of monitoring ML systems. Exploring the essential aspects of monitoring ML workloads on Amazon EKS highlighted the complexities involved in tracking model performance, resource usage, and data quality. Demonstrating fundamental differences between monitoring ML and non-ML workloads showed how ML systems need more specialized metrics and more frequent, granular observations. An in-depth look at the relevant personas involved in ML operations, from data scientists to DevOps engineers, shows how their unique perspectives shape monitoring requirements. Also shown were essential metrics for ML infrastructure, models, and workloads deployed to Amazon EKS, providing concrete examples of what to track and why it matters. This included metrics such as model accuracy, inference latency, GPU usage, and data drift indicators.
In the second part of the series, we focus on observing ML infrastructure and the essential role of monitoring in MLOps. We examine GPU usage on Amazon EKS and demonstrate how to effectively monitor ML infrastructure using AWS services. The post covers logging, monitoring, and alerting implementations using the MLOps platforms’ built-in capabilities. You can learn to set up observability systems that track both infrastructure metrics and ML-specific indicators, enabling you to maintain peak performance of your ML models in production. We provide practical guidance for implementing advanced observability practices in your ML workflows on Amazon EKS. Whether you are dealing with complex distributed training jobs, managing real-time inference services, or optimizing resource allocation for ML workloads, these tools and practices help make sure that your ML systems remain observable, performant, and reliable.
To learn more about ML observability on Amazon EKS, observe the following resources:
- Monitoring GPU workloads on Amazon EKS using AWS managed open-source services
- Open source observability for AWS Inferentia nodes within Amazon EKS clusters
- Maximizing GPU utilization with NVIDIA’s Multi-Instance GPU (MIG) on Amazon EKS: Running more pods per GPU for enhanced performance
- GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances
- Run Spark-RAPIDS ML workloads with GPUs on Amazon EMR on EKS
- For more information on the broader ecosystem of MLOps, go to the AWS labs Data on Amazon EKS GitHub repository, and you can observe the wide range of services that are used in this space.