Containers

Category: Artificial Intelligence

Enhancing and monitoring network performance when running ML Inference on Amazon EKS

In this post, we explore how to enhance and monitor network performance for ML inference workloads running on Amazon EKS using the newly launched Container Network Observability feature. We demonstrate practical use cases through a sample Stable Diffusion image generation workload, showing how platform teams can visualize service communication, analyze traffic patterns, investigate latency issues, and identify network bottlenecks—ultimately improving metrics like inference latency and time to first token.

Introducing the fully managed Amazon EKS MCP Server (preview)

Learn how to manage your Amazon Elastic Kubernetes Service (Amazon EKS) clusters through simple conversations instead of complex kubectl commands or deep Kubernetes expertise. This post shows you how to use the new fully managed EKS Model Context Protocol (MCP) Server in Preview to deploy applications, troubleshoot issues, and upgrade clusters using natural language with no deep Kubernetes expertise required. We’ll walk through real scenarios showing how conversational AI turns multi-step manual tasks into simple natural language requests.

Accelerate container troubleshooting with the fully managed Amazon ECS MCP server (preview)

Amazon ECS today launched a fully managed, remote Model Context Protocol (MCP) server in preview, enabling AI agents to provide deep contextual knowledge of ECS workflows, APIs, and best practices for more accurate guidance throughout your application lifecycle. In this post, we walk through how to streamline your container troubleshooting using the Amazon ECS MCP server, which offers intelligent AI-assisted inspection and diagnostics through natural language queries in CLI tools like Kiro, IDEs like Cline and Cursor, and directly within the Amazon ECS console through Amazon Q.

Kubernetes right-sizing with metrics-driven GitOps automation

In this post, we introduce an automated, GitOps-driven approach to resource optimization in Amazon EKS using AWS services such as Amazon Managed Service for Prometheus and Amazon Bedrock. The solution helps optimize Kubernetes resource allocation through metrics-driven analysis, pattern-aware optimization strategies, and automated pull request generation while maintaining GitOps principles of collaboration, version control, and auditability.

How to run AI model inference with GPUs on Amazon EKS Auto Mode

In this post, we show you how to swiftly deploy inference workloads on EKS Auto Mode and demonstrate key features that streamline GPU management. We walk through a practical example by deploying open weight models from OpenAI using vLLM, while showing best practices for model deployment and maintaining operational efficiency.

Unlocking next-generation AI performance with Dynamic Resource Allocation on Amazon EKS and Amazon EC2 P6e-GB200

In this post, we explore how Amazon EC2 P6e-GB200 UltraServers are transforming distributed AI workload through seamless Kubernetes integration, featuring NVIDIA GB200 Grace Blackwell architecture that enables memory-coherent domains of up to 72 GPUs. The post demonstrates how Dynamic Resource Allocation (DRA) on Amazon EKS enables sophisticated GPU topology management and cross-node GPU communication through IMEX channels, making it possible to efficiently train and deploy trillion-parameter AI models at scale.

UTH - Amazon EKS ultra scale clusters featured image

Under the hood: Amazon EKS ultra scale clusters

This post was co-authored by Shyam Jeedigunta, Principal Engineer, Amazon EKS; Apoorva Kulkarni, Sr. Specialist Solutions Architect, Containers and Raghav Tripathi, Sr. Software Dev Manager, Amazon EKS. Today, Amazon Elastic Kubernetes Service (Amazon EKS) announced support for clusters with up to 100,000 nodes. With Amazon EC2’s new generation accelerated computing instance types, this translates to […]

Featured image: Amazon EKS 100K nodes per cluster

Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster

We’re excited to announce that Amazon Elastic Kubernetes Service (Amazon EKS) now supports up to 100,000 worker nodes in a single cluster, enabling customers to scale up to 1.6 million AWS Trainium accelerators or 800K NVIDIA GPUs to train and run the largest AI/ML models. This capability empowers customers to pursue their most ambitious AI […]

AI on EKS Featured Image

Introducing AI on EKS: powering scalable AI workloads with Amazon EKS

This blog post was jointly authored by Vara Bonthu, Principal OSS Specialist Solutions Architect and Omri Shiv, Senior Open Source ML Engineer Introduction We’re excited to announce the launch of AI on EKS: a new open source initiative from Amazon Web Services (AWS) designed to help customers deploy, scale, and optimize artificial intelligence/machine learning (AI/ML) […]

How Vannevar Labs cut ML inference costs by 45% using Ray on Amazon EKS

This blog is authored by Colin Putney (ML Engineer at Vannevar Labs), Shivam Dubey (Specialist SA Containers at AWS), Apoorva Kulkarni (Sr.Specialist SA, Containers at AWS), and Rama Ponnuswami (Principal Container Specialist at AWS). Vannevar Labs is a defense tech startup, successfully cut machine learning (ML) inference costs by 45% using Ray and Karpenter on Amazon Elastic Kubernetes Service (Amazon EKS). […]