October 10, 2023

3 Common Mistakes with PromQL and Kubernetes Metrics

Guest blog post by Luvpreet Singh. Luvpreet is a Platform Engineer at GoHighLevel where he is responsible for the monitoring of cloud infrastructure with Prometheus and Grafana.

Millions of developers write PromQL queries and build custom Grafana dashboards for Kubernetes. And everyone uses the same underlying metrics from node-exporter, kubelet, and kube-state-metrics. Unfortunately, there are some little-known pitfalls that many people don’t know how to avoid.

“When people tell me they’ve learned from experience, I tell them the trick is to learn from other people’s experience.” — Warren Buffett

In this post, I’ll troubleshoot a simple-looking Prometheus query for Kubernetes that is supposed to return a pod’s memory usage:

container_memory_working_set_bytes{pod="agency-dashboard-api-89b7f557c-xd4l7"}

Unfortunately, this simple query is wrong. Let’s find out why — and how to avoid making similar mistakes in your queries.

Mistake #1: Duplicate Series

The first step to debugging any PromQL query is to verify that you have the exact number of time series in the results that you expect. In this case, I expect a single result, given that I’m filtering on a single pod.

Instead, I get two results (I’ve redacted some sensitive labels or changed their values):

container_memory_working_set_bytes{__replica__="replica-0", cluster="dev-cluster", container="agency-dashboard-api", endpoint="https-metrics", id="/kubepods/burstable/pod66a2c312-6066-42e9-99f8-6e045b863e9d/d24d2a7b79f4fac2f51ab2d9126b9be9a58bd122b18c6928faac463882df7ef0", instance="10.128.0.79:10250", job="cadvisor", metrics_path="/metrics/cadvisor", name="d24d2a7b79f4fac2f51ab2d9126b9be9a58bd122b18c6928faac463882df7ef0", namespace="default", node="gke-dev-cluster-lg-workloads-addf1-44", pod="agency-dashboard-api-89b7f557c-xd4l7", project="dev-k8s", service="kubelet"}


container_memory_working_set_bytes{__replica__="replica-0", cluster="dev-cluster", endpoint="https-metrics", id="/kubepods/burstable/pod66a2c312-6066-42e9-99f8-6e045b863e9d/d24d2a7b79f4fac2f51ab2d9126b9be9a58bd122b18c6928faac463882df7ef0", instance="10.128.0.79:10250", job="cadvisor", metrics_path="/metrics/cadvisor", name="d24d2a7b79f4fac2f51ab2d9126b9be9a58bd122b18c6928faac463882df7ef0", namespace="default", node="gke-dev-cluster-lg-workloads-addf1-44", pod="agency-dashboard-api-89b7f557c-xd4l7", project="dev-k8s", service="kubelet"}

Initially, I suspected something might be wrong with my monitoring setup — maybe I had multiple jobs scraping the same data. With Prometheus, in order to plot data correctly, you have to use proper labels in the query. (Prometheus adds some labels to your query, such as job and instance.) So I had to find out whether the same data was being scraped by multiple jobs or being emitted by multiple instances — because if the same data is coming from multiple places, the query will return confusing or incorrect results. (Check out this YouTube video about Prometheus monitoring mistakes for more on this.)

I added the job label:

container_memory_working_set_bytes{pod="agency-dashboard-api-89b7f557c-xd4l7", job="kubelet"}

But the query still resulted in two time series for the same pod. This meant that only one job was scraping the data, so my setup was correct. But the query was still wrong.

Problem #2: Grouping/Sum Mistakes

Since the query was returning two almost identical time series, the genius in me decided to sum the metrics and divide the result by two.

sum(container_memory_working_set_bytes{pod=”my-pod-123”, job=”kubelet”}) by (pod) / 2

But when I cross-checked the results of this query with the data from my GKE console, I found that the query produced varying results according to time. Why it produced these results also had something to do with how GKE calculates the metrics. GKE takes only non-evictable memory into account.

The lesson? Don’t de-dupe metrics by averaging them — not until you understand in depth how they’re calculated and why there’s a duplicate in the first place.

Problem #3: Unexpected Cardinality

Let’s go back to our duplicate time series and take a closer look. How are the two time series different from one another?

The answer lies in the container label. This label exists in one time series but is missing from the other.

To understand why, we need to take a detour and know what a pause container is. From the Kubernetes docs:

“In a Kubernetes Pod, an infrastructure or “pause” container is first created to host the container. In Linux, the cgroups and namespaces that make up a pod need a process to maintain their continued existence; the pause process provides this. Containers that belong to the same pod, including infrastructure and worker containers, share a common network endpoint (same IPv4 and / or IPv6 address, same network port spaces). Kubernetes uses pause containers to allow for worker containers crashing or restarting without losing any of the networking configuration.”

When a pod is created in Kubernetes, the number of containers created in the node is always  more than the number of pods specified in the pod manifest. You can verify this by SSHing into the node and running `docker ps`. The pause container has all the network configurations of the pod. Deleting the pause container will also delete the application container, and the pod will be restarted.

The duplicate time series now makes sense. I have one time series per container, including the empty pause container. 

I then fixed the query and finally got the correct metrics.

container_memory_working_set_bytes{pod=”agency-dashboard-api-89b7f557c-xd4l7”, job=”kubelet”, container!=”}

The lesson? Always check your labels. Make sure you understand which labels can change, which labels are hard-coded, and which labels are optional. Most mistakes regarding labels or cardinality will lead to incorrect metrics

Summary

I hope that seeing a simple but incorrect PromQL query will help you avoid making similar mistakes with your own metrics. PromQL is simple, but that simplicity is deceptive. For every query, make sure you know exactly you’re measuring. It’s easy for extra data to slip in and throw off your aggregations.

Good luck and happy monitoring!

Never miss a blog post.