TLDR: On some cloud providers, you get half the CPU you expect due to burstable nodes. Without burstable nodes, overhead is improved but still significant.
I recently deployed a pod requesting 1 CPU to a shiny new Kubernetes cluster. The cluster had plenty of empty nodes with 2 CPUs each. Yet my Pod was stuck in Pending and couldn't be scheduled.
How can a node with 2 CPUs not fit a pod requesting 1 CPU?
In this post, we'll understand node overhead on Kubernetes. Then we'll benchmark cloud providers and find the worst offenders.
Update: After publishing this post, I received a flood of feedback and questions, mostly about burstable nodes. I've published an update here that explains the controversy. Even though GKE is making only half the node available, is their behaviour better than what AWS and AKS do? It's tricky. Read the update for details.
Good question. According the docs, every node reserves CPU and Memory for itself.
The reserved resources are split into three parts:
After you subtract reserved resources from total CPU/Memory, what’s left over for pods is known as Node Allocatable.
So the real question about GKE (and other providers) is just how big Node Allocatable is, relative to the node’s total resources?
To find out, I ran kubectl describe on my GKE node:
Holy smokes batman!
GKE is taking more than half the CPU for itself.
The node's capacity is 2 CPU, but node allocatable is 0.94 CPU!
With the default node type you get half what you would think. (Autopilot clusters are better. The inefficiency is still there, but GKE swallows the overhead's cost for you.)
We decided to benchmark managed Kubernetes providers and self-managed solutions. We chose:
There are many possible node types for each cloud provider, which makes comparing them difficult. We chose to benchmark both 2 and 4 CPU nodes. Memory was determined by whatever CPU type we chose.
For self-managed and dev clusters, we just benchmarked clusters we had on-hand.
To make analyzing the results easy, we onboarded all our clusters to a single Robusta.dev account. Then we opened to the Node page and sorted by available CPU.
This let us track all clusters on a single page.
We then exported the data to a spreadsheet, attached below.
First, the raw results.
CPU and memory efficiency were calculated as (Total - Reserved) / Total. Higher scores are better.
A few things are immediately obvious:
Lets graph the results to get a better picture:
Benchmark your nodes!
On a related note, we just released an open source cli for right-sizing Kubernetes workloads. It gives accurate recommendations for requests and limits based on historical Prometheus data. See on GitHub.