September 24, 2022

You Can't Have Both High Utilization and High Reliability

Everyone wants high utilization and high reliability. The hard truth about Kubernetes is that you need to pick one or the other.

By Natan Yellin, Robusta.dev co-founder

Everyone wants high utilization and high reliability. The hard truth about Kubernetes is that you need to pick one or the other.

Here is a question I asked on LinkedIn and Twitter that demonstrates the dilemma:

The question is simple: How much CPU should you request for a pod that usually needs two CPUs, but sometimes needs three.

Lets look at a few strategies and the tradeoffs.

Overcommit with CPU request = 2, limit = 3

This is the naive strategy. At first glance, it seems good. You set aside 2 CPUs for your pod but let it access 3 CPUs when necessary.

There are two problems with this approach:

  1. The 3rd CPU might not be available when you need it.
  2. CPU limits are usually a bad practice on Kubernetes and can mess up P99 latency.

The second point is sometimes controversial, but the first is not.

CPU limits do not guarantee access to resources! They allow extra CPU usage on a best-effort basis. Only the request of 2 CPUs is guaranteed.

Setting a request of 2 and a limit of 3 means you are optimizing for utilization at the expense of reliability! Sometimes you will need CPU but not be able to access it. This will almost certainly increase latency. For some apps, the damage to reliability can be greater.

Overcommit with CPU request = 2, no limit

This is only slightly better than the above strategy. You've gotten rid of the CPU limit, but you are still not guaranteed 3 CPUs.

To be guaranteed 3 CPUs during a spike, you need the Kubernetes scheduler to permanently reserve 3 CPUs for you. You wont use the 3rd CPU most of the time, but it needs to be reserved in advance (and not promised to other pods). That is the only way you can guarantee usage when necessary.

A CPU request of 2 again chooses utilization over reliability.

Underutilize with CPU request = 3

Now we're talking! For the first time, you are guaranteed 3 CPU during a spike.

This works because of a little known fact: CPU requests are enforced at runtime! CPU requests are not just for scheduling.

A CPU request of 3 is a hard promise by Kubernetes that you will get 3 CPUs when you need them.

The downside is decreased utilization and wasted compute capacity. By setting aside a 3rd CPU for your pod, you are "using up" that CPU from a cluster-scheduling perspective. The Kubernetes scheduler will not schedule other pods to the node if they need that CPU to guarantee their request. Each CPU unit can be assigned to one and only one pod by the Kubernetes scheduler.

On the other hand, from a Linux scheduling perspective on the node itself, this CPU is available when not in use. That is, other pods already on the same node will be able to use this CPU when free.

The node will be underutilized when the pod isn't spiking. If all the pods in your cluster behaved like this spikey pod, you would have 66% utilization most of the time! That's a lot of wasted compute capacity! (2 CPUs in use, 3 CPUs requested = 2/3 = 66%.)

In short, a CPU request of 3 chooses reliability over utilization.

The fundamental tradeoff of utilization vs reliability

Size dynamically with autoscaling

Is autoscaling a solution?

Yes! In fact, autoscaling is the textbook solution for this problem. But autoscaling doesn't solve the fundamental tradeoff, rather it decreases the scope of it.

The problem is that autoscaling has limitations. They mostly result from the fact that scaling workloads is not immediate. There can be significant lead time when you create or delete pods. (Your definition of "significant" will vary widely depending on your industry and pod's behaviour.)

Here are some potential pain points with Kubernetes autoscaling. They too result from the fundamental tradeoff of utilization vs reliability:

  1. Upscaling lag: there is always a delay between the moment you scale up and the moment new capacity is actually available. If you scale up too early you waste compute capacity, If you scale up too late you harm reliability.
  2. Downscaling lag: even a well tuned autoscaler waits before scaling down. Autoscalers can't predict the future, so they can't know for certain when a spike has really ended. The deliberate lag increases reliability, but the area under the curve is wasted compute capacity.
  3. Complexity: this is the biggest downside for many. Autoscalers have complexity and tuning them can be complex. A more accurate title for this post would be "choose two out of three: utilization, stability, and simplicity"

Some optimism about the future

Kubernetes will soon have an alpha feature that allows adjusting pod resources at runtime without restarting the pod. This won't change the fundmantal tradeoff. (What happens if you can't increase requests because of other pods on the node?) However, it does provide ammunition to the utilization camp.

The utilization vs reliability tradeoff has a simple root cause: scaling pods is a non-immediate operation. Maybe technologies like WebAssembly and V8 isolates can change that?

For now, look closely are your requirements, understand the trade-off, and choose where you want to be.

About this blog

This blog is brought to you by Robusta.dev. We develop an open source project for improving Prometheus alerts by running commands like `kubectl logs` and attaching their output to your alerts. Get better notifications about alerts and crashing pods in Slack, MSTeams, OpsGenie, and more.

Never miss a blog post.