April 7, 2023

When Is a CPU Not a CPU? Benchmark of Kubernetes Providers and Node Efficiency.

TLDR: On some cloud providers, you get half the CPU you expect due to burstable nodes. Without burstable nodes, overhead is improved but still significant.

I recently deployed a pod requesting 1 CPU to a shiny new Kubernetes cluster. The cluster had plenty of empty nodes with 2 CPUs each. Yet my Pod was stuck in Pending and couldn't be scheduled.

How can a node with 2 CPUs not fit a pod requesting 1 CPU?

In this post, we'll understand node overhead on Kubernetes. Then we'll benchmark cloud providers and find the worst offenders.

Update: After publishing this post, I received a flood of feedback and questions, mostly about burstable nodes. I've published an update here that explains the controversy. Even though GKE is making only half the node available, is their behaviour better than what AWS and AKS do? It's tricky. Read the update for details.

How much CPU is reserved on Kubernetes Nodes?

Good question. According the docs, every node reserves CPU and Memory for itself.

Node CPU Explained (Source: Kubernetes docs; License: CC BY 4.0)

The reserved resources are split into three parts:

  • kube-reserved - for Kubernetes daemons like kubelet and the container runtime
  • system-reserved - for system daemons like ssh
  • eviction-threshold - this is only relevant for memory, not CPU. It’s a buffer that reduces the chance of OOMKills by giving Kubernetes a chance to evict Pods first.

After you subtract reserved resources from total CPU/Memory, what’s left over for pods is known as Node Allocatable.

So the real question about GKE (and other providers) is just how big Node Allocatable is, relative to the node’s total resources?

To find out, I ran kubectl describe on my GKE node:

Name:               gke-...
Labels:             beta.kubernetes.io/arch=amd64
  cpu:                        2
  memory:                     4025892Ki
  cpu:                        940m
  memory:                     2880036Ki

Holy smokes batman!

GKE is taking more than half the CPU for itself.

The node's capacity is 2 CPU, but node allocatable is 0.94 CPU!

With the default node type you get half what you would think. (Autopilot clusters are better. The inefficiency is still there, but GKE swallows the overhead's cost for you.)

See for yourself

  1. Setup a non-autopilot GKE cluster.
  2. Apply a demo Pod that requests 1 CPU
  3. Run kubectl get pods and observe the pending pod.
  4. Bonus: add the cluster to Robusta.dev and discover other health issues on your cluster.

Benchmarking other Kubernetes vendors and node types

We decided to benchmark managed Kubernetes providers and self-managed solutions. We chose:

  • 5 cloud providers - EKS, GKE, AKS, DigitalOcean, and CIVO
  • 2 self-managed Kubernetes setups - OpenShift, Rancher
  • 2 local-dev setups - KIND and Minikube

There are many possible node types for each cloud provider, which makes comparing them difficult. We chose to benchmark both 2 and 4 CPU nodes. Memory was determined by whatever CPU type we chose.

For self-managed and dev clusters, we just benchmarked clusters we had on-hand.

To make analyzing the results easy, we onboarded all our clusters to a single Robusta.dev account. Then we opened to the Node page and sorted by available CPU.

This let us track all clusters on a single page.

Kubernetes Nodes with very high overhead on GKE

We then exported the data to a spreadsheet, attached below.


First, the raw results.

Kubernetes CPU and Memory efficiency

CPU and memory efficiency were calculated as (Total - Reserved) / Total. Higher scores are better.

A few things are immediately obvious:

  1. Memory efficiency is much worse than CPU efficiency, regardless of cloud provider.
  2. AKS and GKE are the worst offenders, with AKS reserving up to 46% of memory, and GKE reserving up to 53% of CPU.
  3. We only measured one node type for AWS but it was pretty good! (Note to self: ask the team why we didn't benchmark two node types on AWS.)
  4. Some clusters, like Rancher and CIVO, supposedly have 100% efficiency. I'm not sure that's a good thing! Reserving some resources for the node is important to guarantee stability. Are they reserving resources in some other way (e.g. using DaemonSets for system workloads or using static pods.) or are they playing it risky? Not sure.

Lets graph the results to get a better picture:

Raw data available here


Benchmark your nodes!

On a related note, we just released an open source cli for right-sizing Kubernetes workloads. It gives accurate recommendations for requests and limits based on historical Prometheus data. See on GitHub.

Never miss a blog post.