Containers

Inside Pinterest’s Custom Spark Job logging and monitoring on Amazon EKS: Using AWS for Fluent Bit, Amazon S3, and ADOT

In Part 1, we explored Moka’s high-level design and logging infrastructure, showcasing how AWS for Fluent Bit, Amazon S3, and a robust logging framework make sure of operational visibility and facilitate issue resolution. For more details, read part 1 here.

Introduction

As we transition to the second part of our series, our focus shifts to metrics and observability within the Moka platform. Observability is crucial for making sure of the stability, performance, and efficiency of any Kubernetes (K8s) platform, and Amazon Elastic Kubernetes Service (Amazon EKS) is no exception to this. In this section, we delve into the strategies and tools that Pinterest uses to monitor their EKS clusters, providing detailed insights into system performance and application behavior.

In this part, we explore how Pinterest uses Prometheus formatted metrics and OpenTelemetry (OTEL) to build a comprehensive observability framework. This framework includes the `kubemetricexporter` sidecar for scraping metrics, the AWS Distro for OpenTelemetry (ADOT) for detailed insights, and the kube-state-metrics for a broader overview of the Amazon EKS control plane. By integrating these tools, we can efficiently collect, process, and analyze metrics, making sure that our platform remains scalable and performant.

Metrics and observability

In order to operate a K8s platform efficiently, storing metrics in a queryable, displayable format is critical to overall platform stability, performance/efficiency, and, ultimately, operating costs. Amazon EKS is no exception in this regard, being closely aligned to the upstream K8s project.

Prometheus formatted metrics are the standard for K8s ecosystem tools. Observability frameworks such as Prometheus (the project, not the format), OTEL, and other CNCF projects continue to see increases in activity year over year. In 2022, OTEL and Prometheus projects were in the top 10% of activity within all CNCF open source projects.

At Pinterest, the current observability framework, Statsboard, is Time-series database (TSDB-based: a sidecar, metrics-agent runs on every host. Systems typically use TSDB libraries to write to their local metrics-agent, which passes the metrics on to Kafka clusters, after which they are ingested into Goku clusters, Pinterest’s custom TSDB, and made available in Statsboard. In contrast, the Prometheus styled frameworks involve systems exposing their metrics for scraping by agents. Unfortunately, support for TSDB as a metrics destination within the open source Cloud Native Computing Foundation (CNCF)/K8s ecosystem is inactive. Continued custom integration with TSDB is ‘against the grain’ of the broader K8s focus. All known K8s-to-TSDB connectors have received very little community engagement over the past three years.

To address this gap, the Cloud Runtime team at Pinterest has developed kubemetricsexporter, a K8s sidecar that can periodically scrape Prometheus endpoints in a pod and write the scraped metrics to the local metrics-agent. Because Amazon EKS pods can be in a different network than the host, the batch processing platform team at Pinterest worked with the Cloud Runtime team to extend kubemetricexporter so that it could be configured to use the host IP address instead of localhost. The following figure shows the deployment pattern.

Figure 1: Using kubemetricexporter with Prometheus Metrics Source

Figure 1: Using kubemetricexporter with Prometheus Metrics Source

Collecting metrics from an Amazon EKS system consists of three components:

  1. Sources: Location from where to collect metrics.
  2. Agents: Applications running in the Amazon EKS environment, often called an agent, which collects the metrics monitoring data and pushes this data to the second component. Some examples of this component are ADOT and Amazon CloudWatch Agent.
  3. Destinations: A monitoring data storage and analysis solution. This component is typically a data service that is optimized for time series formatted data. Some examples are Amazon Managed Service for Prometheus and Amazon CloudWatch. For Pinterest, this would be Statsboard/Goku.

After exploring a variety of options and configurations, we ultimately decided to use a combination of OTEL for extracting detailed insights from our EKS clusters and kube-state-metrics, an open source K8s tool, for providing a broader overview of the Amazon EKS control plane. In contrast with Prometheus, the OTEL framework only focuses on metrics collection and pre-processing, leaving metrics storage to other solutions. A key portion of the framework is the OpenTelemetry Collector, which is an executable binary that can extract telemetry data, optionally process it, and export it further. The Collector supports several popular open source protocols for receiving and sending telemetry data, as well as offering a pluggable architecture for adding more protocols.

Data receiving, processing, and exporting is done using Pipelines. The Collector can be configured to have one or more pipelines. Each pipeline includes:

  • A set of Receivers that receive the data.
  • A series of optional Processors that get the data from receivers and process it.
  • A set of Exporters that get the data from processors and send it further outside the Collector.

After extensive experimentation, we found the following pipeline configuration works well for our needs:

  1. Receiver: Using the Prometheus receiver provides us with the capability of securely accessing the various EKS cluster endpoints that export their metrics in the Prometheus format. In practice, we needed to iterate through a number of configurations before we arrived at a set of Prometheus scraping jobs that extracted a comprehensive view of the cluster.
  2. Processors: We used the Attributes processor to eliminate a number of redundant and unnecessary fields from the metrics collected from Step 1. This reduced the total volume of data that needed to be processed by ADOT and allowed each metrics data point to be under the 1 KB per metric default limit of metrics-agent/Goku.
  3. Exporter: A key advantage of OTEL was that it supports the use of a Prometheus exporter, allowing the metrics collected upstream in the pipeline to be exposed in the Prometheus format so that it can be scraped by kubemetricexporter.

Our OTEL metrics pipeline looks like the following figure:

Figure 2: OTEL pipeline for Moka observability

Figure 2: OTEL pipeline for Moka observability

For the actual OTEL distribution, we used ADOT. ADOT is a supported version of the OTEL project by AWS that supports integration with a number of AWS services such as Amazon Managed Service for Prometheus and CloudWatch. Although we’re not planning on integrating with these services, AWS support also means that ADOT can be installed through Amazon EKS add-ons onto an EKS cluster. Additionally, the ADOT add-on is compatible with Amazon EKS, and it is regularly updated with the latest bug fixes and security patches.

Guidance on metrics collected

Both the Amazon EKS service and the K8s environment produce and expose hundreds to thousands of metrics. This section lists the essential metrics Pinterest is collecting for Moka as per guidance from the AWS team.

Control plane

The Amazon EKS control plane is managed by AWS and runs in an account managed by AWS. It consists of control plane nodes that run the K8s components, such as etcd and the K8s application programming interface (API) server. The Control Plane API Server exposes thousands of metrics. The following table lists the essential control plane metrics that are being collected.

Cluster state

Name Metric Name Description
API server total requests apiserver_request_total Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
API server latency apiserver_request_duration_seconds Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope, and component.
Request latency rest_client_request_duration_seconds Request latency in seconds. Broken down by verb and URL.
Total requests rest_client_requests_total The number of HTTP requests, partitioned by status code, method, and host.
API server request duration apiserver_request_duration_seconds_bucket Measures the latency for each request to the K8s API server in seconds.
API server request latency sum apiserver_request_latencies_sum The response latency distribution in microseconds for each verb, resource, and subresource.
API server registered watchers apiserver_registered_watchers The number of currently registered watchers for a given resource.
API server number of objects apiserver_storage_object The number of stored objects at the time of last check split by kind.
Admission controller latency apiserver_admission_controller_admission_duration_seconds Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Etcd latency etcd_request_duration_seconds Etcd request latency in seconds for each operation and object type.
Etcd DB size etcd_db_total_size_in_bytes Etcd database size.

The cluster state metrics through the K8s API Server provide insights into the cluster state and K8s objects in the cluster. These metrics are used by K8s to effectively schedule pods, and they are focused on the health of various objects inside the cluster, such as deployments, replica sets, nodes, and pods. Cluster state metrics expose pod information about status, capacity, and availability. They are essential to keep track of how the cluster performs when scheduling tasks. There are hundreds of exposed cluster state metrics. The following table lists the metrics that the Pinterest team found to be the most useful.

 Name Metric Description
Node status kube_node_status_condition Current health status of the node. Returns a set of node conditions and true, false, or unknown for each.
Desired pods kube_deployment_spec_replicas or kube_daemonset_status_desired_number_scheduled The number of pods specified for a Deployment or DaemonSet.
Current pods kube_deployment_status_replicas or kube_daemonset_status_current_number_scheduled The number of pods currently running in a Deployment or DaemonSet.
Pod capacity kube_node_status_capacity_pods The maximum pods allowed on the node.
Available pods kube_deployment_status_replicas_available or kube_daemonset_status_number_available The number of pods currently available for a Deployment or DaemonSet.
Unavailable pods kube_deployment_status_replicas_unavailable or kube_daemonset_status_number_unavailable The number of pods currently not available for a Deployment or DaemonSet.
Pod readiness kube_pod_status_ready If a pod is ready to serve client requests.
Pod status kube_pod_status_phase The current status of the pod value would be pending/running/succeeded/failed/unknown.
Pod waiting reason kube_pod_container_status_waiting_reason The reason a container is in a waiting state.
Pod termination status kube_pod_container_status_terminated Whether the container is currently in a terminated state or not.
pending_pods
Pod scheduling attempts pod_scheduling_attempts The number of attempts made to schedule pods.

Observability performance and scaling considerations

The observability configuration for Moka runs both kube-state-metrics and an ADOT collector (with kubemetricexporter sidecars) as K8s Deployments with a replica of one.. In other words, they are, by default, single pods. This leads to the following scaling considerations:

  • In addition to ADOT and kube-state-metrics, it is necessary to keep an eye on kubemetricexporter because it could potentially be a bottleneck as the cluster scales.
  • If the ADOT/OTEL collector includes receivers that perform scraping, such as the Prometheus receiver, then we must make sure that the total time taken to scrape all targets during a single iteration does not come close to the configured scrape interval. Once that occurs, we must shard the Collector. We do this by creating multiple instances of the Collector, with each running a different set of scrape jobs.
  • If kubemetricexporter takes longer than 60 seconds to scrape endpoints, then we have to either reconfigure the pullinterval to a longer time and/or deploy more kubemetricexporter instances/pods (and shard the ADOT collector).
  • Monitor the ADOT collector health. ADOT exposes its own set of metrics in Prometheus format. This is particularly useful for checking collector resource usage and performance.

Conclusion

Pinterest’s transition from the Monarch Hadoop-based platform to the Moka Spark on the Amazon EKS platform marks a pivotal advancement in meeting the growing data processing demands. By developing Moka, we have built a scalable, efficient, and robust system that addresses critical needs such as job processing, logging, and observability. By using Fluent Bit for logging and OpenTelemetry with kube-state-metrics for metrics collection, we make sure of comprehensive monitoring and efficient management of our EKS clusters.

This journey shows the significance of continuous improvement in data processing infrastructure. Our efforts to optimize Moka demonstrate the importance of scalability and observability in supporting Pinterest’s mission to provide data-driven insights and recommendations. We remain committed to refining our platform and welcome collaboration and feedback from the community as we move forward. Thank you for following our journey.

Acknowledgements: Special Thanks to Greg Mark, Doug Youd, Alan Halcyon, and Vara Bonthu from the strategic account team for their continued support in the Moka migration. We also appreciate the AWS WWSO team, especially Aaron Miller, Alex Lines, and Nirmal Mehta, for their support with OTEL.