TODO: split to two posts - one on root cause analysis, one on auditing.
The term "root cause analysis" makes most people think of complex machine-learning algorithms.
But the core idea is very simple. What's more, root cause analysis is increasingly important one as digital systems become more complex.
For example, imagine an application that experiences CPU throttling. Back in the old days, the cause of this was simple. The server didn't have enough CPUs and you needed to either optimize the application or put it on a bigger server. However, in a modern Kubernetes cluster, the cause is not obvious and neither is the fix.
Lets look at an every day example. Then we'll look at root cause analysis in Kubernetes.
What is root cause analysis?
Imagine the following: you plug in an electric kettle, a fuse blows, and the power goes out. You obviously suspect the kettle.
Root cause found!
Why root cause analysis matters in Kubernetes?
Most Kubernetes clusters are frenetic, energetic places. Every now and then, some human updates a line of YAML, in one of the many teams working on the cluster. Now, long sleeping Kubernetes controllers wake up. They work slavishly to make the intentions conveyed in that YAML true. They spin pods up, rapidly pulling images, mounting volumes, and even terminating other pods if necessary. All in a mad rush to make status equal spec.
But when something goes wrong, can we turn back the arrow of time and see which human-made change triggered the problem?
Indeed we can! Lets see a few ways to do so.
Sources of truth
The four standard ways of monitoring changes to a Kubernetes cluster are:
- Instrument your CI/CD pipeline
- Use GitOps
- Use the Kubernetes Audit API
- Connect to the API Server and listen for changes
The order in the list above is deliberate. Each method sees types of changes than the previous method. Lets look at each one.
Instrumenting your CI/CD pipeline
All I have to say on this method is as follows: Don't use it.
This method is inferior to every other method in the list and has little benefit. It is listed only for the sake of completeness.
Using GitOps
Now we're talking.
If you use GitOps, you already have a full history of all the desired changes to your cluster.
If you want to see what was supposed to run in your cluster last Thursday, you simply open up GitHub and look at the history. Or run `git log` in your terminal.
There are only two downsides to this approach:
- It shows what was supposed to run (e.g. the desired state) but it does not show you what actually happened in the cluster
- You do not have visibility into manual bypasses and overrides. If a naughty developer ran kubectl edit in production, you cannot see what they did.
All the same, you definitely should use GitOps! The methods that follow do not replace GitOps as a method of deploying software, rather add complimentary visibility into actual cluster changes.
Using the Kubernetes Audit API
With the audit API, we arrive at the first solution that was built precisely for our needs.
Tracking changes in Kubernetes
We took KubeWatch and added on an extra layer for common use cases.
There are four variations to this, depending on where you send that data.
Option 1: UI for Kubernetes change history
This is the easiest to setup. You run one Helm command to install Robusta and it's bundled Prometheus stack. Now you have a single dashboard with all changes and alerts across your clusters.
Option 2: Grafana Dashboards
Let's take existing Grafana dashboards and use annotations to show when applications were updated:
This takes eight lines of YAML to configure with Robusta:
This works by connecting the on_deployment_update trigger to the add_deployment_lines_to_grafana action.
Option 3: Slack notifications
This is the same as above, but we're sending the result to Slack and not Grafana.
Here is the YAML configuration for this:
This works by connecting the on_deployment_update trigger to the resource_babysitter action and sending the result to the Slack sink. You could just as easily send the output to MS Teams, Telegram, DataDog, OpsGenie, or any other supported sink.
Solution 4: Reverse GitOps
This one is a little unusual, but we can send the same change data to a git repository.
Usually git repositories are used with GitOps as the source of truth for YAML files. However, git is also a convenient way to store audit data about what actually changed in your cluster. We can use git to audit every change in your cluster.
Every time someone makes a change, whether it's an ad-hoc change or a planned deployment, it's written to a git repository at a path determined by the cluster name, namespace, and resource type.
Here is how we configure this Robusta automation:
Like the other examples, we're hooking up a trigger to an action. The trigger is more broad here, as it is on_kubernetes_any_resource_all_changes. The action is git_change_audit.
Summary
Hopefully this will help you setup change tracking on your Kubernetes cluster. Good luck!