The term "root cause analysis" makes most people think of complex machine-learning algorithms.
But the core idea is very simple. What's more, root cause analysis is increasingly important one as digital systems become more complex.
For example, imagine an application that experiences CPU throttling. Back in the old days, the cause of this was simple. The server didn't have enough CPUs and you needed to either optimize the application or put it on a bigger server. However, in a modern Kubernetes cluster, the cause is not obvious and neither is the fix.
What is root cause analysis?
Imagine the following: you plug in an electric kettle, a fuse blows, and the power goes out. You obviously suspect the kettle.
Root cause found!
Why root cause analysis matters in Kubernetes?
Most Kubernetes clusters are frenetic, energetic places. Every now and then, some human updates a line of YAML, in one of the many teams working on the cluster. Now, long sleeping Kubernetes controllers wake up. They work slavishly to make the intentions conveyed in that YAML true. They spin pods up, rapidly pulling images, mounting volumes, and even terminating other pods if necessary. All in a mad rush to make status equal spec.
But when something goes wrong, can we turn back the arrow of time and see which human-made change triggered the problem?
Indeed we can! Lets see a few ways to do so.
Sources of truth
The four standard ways of monitoring changes to a Kubernetes cluster are:
- Instrument your CI/CD pipeline
- Use GitOps
- Use the Kubernetes Audit API
- Connect to the API Server and listen for changes
The order in the list above is deliberate. Each method sees types of changes than the previous method.
