Kubernetes
November 23, 2022

4 Ways Pods Suddenly Stop Running on Kubernetes

Most people are familiar with CrashLoopBackOffs, but there are actually many ways a pod can unexpectedly stop running. Here are the top four:

  1. OOM Kills
  2. CrashloopBackoff
  3. Init:CrashLoopBackOff
  4. Evictions

OOM Kills

When OOM Kills occur: A pod uses up “too much” memory. That is, more memory than the limit or more memory than is available on the node. 

How OOM Kills work: The Linux Kernel kills the process, causing an OOMKill (Out of Memory Kill). No warning is given, and Kubernetes has little control over this process. 

Why Kubernetes does OOM Kills: You’re out of memory. But Kubernetes doesn’t kill the pod, the Linux kernel does.

Deliberately reproducing OOM Kills: Run `kubectl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/oomkills_demo/oomkill_job.yaml `

What it looks like:

Slack alert with details about the OOMKill

Questions to ask when troubleshooting OOM Kills:

  • What does a graph of memory usage look like?
  • Did this pod go over the limit, or did the node run out of memory?
  • Is there a leak in the application?

CrashloopBackoff

When CrashLoopBackOff occurs: Every time a pod crashes, Kubernetes restarts it after some time. The time between each restart is called Backoff time, and it is increased gradually. Too many restarts and it ends up in the CrashloopBackoff state.

How CrashloopBackoff work: The pod goes into CrashloopBackoff state and remains that way until the BackOff period ends.

Why kubernetes does CrashloopBackoff: Your pod is repeatedly crashing and restarts don't help. To avoid unnecessary repetitive crashes, Kubernetes waits a little in the hope that things will improve. (E.g., if the pod is crashing because an external service is down, and it goes back up.)

Deliberately reproducing CrashloopBackoff: Run `kubecl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/crashloop_backoff/create_crashloop_backoff.yaml`

What it looks like:

CrashLoopBackOff alert on slack

Questions to ask when troubleshooting CrashLoopBackOffs:

  • Why did the process inside the container terminate?
  • What errors are in the pod log?
  • Is the issue an application bug or a Kubernetes infrastructure issue?

Init:CrashLoopBackOff

What are init-containers: Init-containers are used to perform preparations before your main container runs. Your main container runs only if the Init-container exits successfully.

When Init:CrashLoopBackOff occur: When an Init-container has a CrashLoopBackOff, this is called – you guessed it – an Init:CrashLoopBackOff

Deliberately reproducing Init:CrashLoopBackOff: Run `kubectl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/init_crashloop_backoff/create_init_crashloop_backoff.yaml`

What it looks like: 

Logs of a failed Init-container

Questions to ask when troubleshooting Init:CrashLoopBackOff:

Get the logs (just like CrashLoopBackOff) but make sure you specify the Init-container you want logs for: ‘kubectl logs <pod-name> -c <Init-container>’. Try to understand if this is an issue with the Init-container itself (an application issue) or a Kubernetes infrastructure related issue.

Pod Evictions

When evictions occur: A node runs out of resources, so Kubelet starts terminating pods to reclaim resources for essential processes. Alternatively, you can use the `Eviction API`, and manually terminate a pod. Finally, pods can be evicted due to Pod Priorities, as illustrated below.

Why Kubernetes evicts pods: Node-pressure evictions happen to protect the node. Other evictions happen for the reason you choose.

How you can deliberately reproduce Pod Evictions: 

In order to reproduce an eviction without causing chaos in your cluster, we’re going to pick a victim node that will be interrupted during the process. Don’t try this on a production cluster!

  1. Stop new pods from running on the node: `kubectl taint nodes <node_name> key1=value1:NoSchedule`. 
  2. Drain existing pods running on the node: `kubectl drain <node_name> `.
  3. You might have to use additional arguments depending on your setup. Eg: `--delete-emptydir-data`, `--ignore-daemonsets`.
  4. Make the node schedulable again: `kubectl uncordon <node_name>`
  5. Run a pod that uses many resources and will only run on that node. `kubectl apply -f  https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/evictions/first_pod.yaml`.
  6. You want this pod to request more than half the CPU on the node. Adjust the YAML if necessary.
  7. Create a new pod priority level: `kubectl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/evictions/priority.yaml` 
  8. Create a new pod with high priority that can only run on our victim node. This will evict the previous pod: `kubectl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/evictions/pod_with_priority.yaml`
  9. If you adjusted the CPU request for the previous node, adjust it here too. We need these pods to be mutually exclusive – only one should be able to run on the victim node at once.

What it looks like:

Kubernetes events showing evicted pod

Questions to ask yourself to troubleshoot:

  • Does the pod have the right tolerances?
  • Why was the pod evicted?
  • Was the issue pod priorities or node pressure?
  • Was this eviction really desirable and healthy, or was it an accidental and unwanted event?

Questions? Comments?

Yes, please. I’m on Twitter and LinkedIn. Also, check out Robusta.dev to get notified when problems like these occur in your cluster.

Never miss a blog post.