Why and How to Audit Kubernetes Changes (Updated)

TODO: split to two posts - one on root cause analysis, one on auditing.

The term "root cause analysis" makes most people think of complex machine-learning algorithms.

But the core idea is very simple. What's more, root cause analysis is increasingly important one as digital systems become more complex.

For example, imagine an application that experiences CPU throttling. Back in the old days, the cause of this was simple. The server didn't have enough CPUs and you needed to either optimize the application or put it on a bigger server. However, in a modern Kubernetes cluster, the cause is not obvious and neither is the fix.

Lets look at an every day example. Then we'll look at root cause analysis in Kubernetes.
‍

What is root cause analysis?

Imagine the following: you plug in an electric kettle, a fuse blows, and the power goes out. You obviously suspect the kettle.

Root cause found!
‍

Why root cause analysis matters in Kubernetes?

Most Kubernetes clusters are frenetic, energetic places. Every now and then, some human updates a line of YAML, in one of the many teams working on the cluster. Now, long sleeping Kubernetes controllers wake up. They work slavishly to make the intentions conveyed in that YAML true. They spin pods up, rapidly pulling images, mounting volumes, and even terminating other pods if necessary. All in a mad rush to make status equal spec.

But when something goes wrong, can we turn back the arrow of time and see which human-made change triggered the problem?

Indeed we can! Lets see a few ways to do so.

Sources of truth

The four standard ways of monitoring changes to a Kubernetes cluster are:

Instrument your CI/CD pipeline
Use GitOps
Use the Kubernetes Audit API
Connect to the API Server and listen for changes

The order in the list above is deliberate. Each method sees types of changes than the previous method. Lets look at each one.
‍

Instrumenting your CI/CD pipeline

All I have to say on this method is as follows: Don't use it.

This method is inferior to every other method in the list and has little benefit. It is listed only for the sake of completeness.
‍

Using GitOps

Now we're talking.

If you use GitOps, you already have a full history of all the desired changes to your cluster.

If you want to see what was supposed to run in your cluster last Thursday, you simply open up GitHub and look at the history. Or run `git log` in your terminal.

There are only two downsides to this approach:

It shows what was supposed to run (e.g. the desired state) but it does not show you what actually happened in the cluster
You do not have visibility into manual bypasses and overrides. If a naughty developer ran kubectl edit in production, you cannot see what they did.

All the same, you definitely should use GitOps! The methods that follow do not replace GitOps as a method of deploying software, rather add complimentary visibility into actual cluster changes.

Using the Kubernetes Audit API

With the audit API, we arrive at the first solution that was built precisely for our needs.
‍

Tracking changes in Kubernetes

We took KubeWatch and added on an extra layer for common use cases.

There are four variations to this, depending on where you send that data.

Option 1: UI for Kubernetes change history

This is the easiest to setup. You run one Helm command to install Robusta and it's bundled Prometheus stack. Now you have a single dashboard with all changes and alerts across your clusters.

Option 2: Grafana Dashboards

Let's take existing Grafana dashboards and use annotations to show when applications were updated:

This takes eight lines of YAML to configure with Robusta:

customPlaybooks:
  - triggers:  
      - on_deployment_update: {}
    actions:  
      - add_deployment_lines_to_grafana:      
      		grafana_api_key: '********'     
          grafana_dashboard_uid: 09ec8aa1e996d6ffcd6817bbaff4db1b      
          grafana_url: http://grafana.namespace.svc

This works by connecting the on_deployment_update trigger to the add_deployment_lines_to_grafana action.

Option 3: Slack notifications

This is the same as above, but we're sending the result to Slack and not Grafana.

Here is the YAML configuration for this:

customPlaybooks:
	- triggers:  
  		- on_deployment_update: {}  
   	actions:
    	- resource_babysitter: {}
   	sinks:  - slack

This works by connecting the on_deployment_update trigger to the resource_babysitter action and sending the result to the Slack sink. You could just as easily send the output to MS Teams, Telegram, DataDog, OpsGenie, or any other supported sink.

Solution 4: Reverse GitOps

This one is a little unusual, but we can send the same change data to a git repository.

Usually git repositories are used with GitOps as the source of truth for YAML files. However, git is also a convenient way to store audit data about what actually changed in your cluster. We can use git to audit every change in your cluster.

Every time someone makes a change, whether it's an ad-hoc change or a planned deployment, it's written to a git repository at a path determined by the cluster name, namespace, and resource type.

Here is how we configure this Robusta automation:

Like the other examples, we're hooking up a trigger to an action. The trigger is more broad here, as it is on_kubernetes_any_resource_all_changes. The action is git_change_audit.
‍

Summary

Hopefully this will help you setup change tracking on your Kubernetes cluster. Good luck!

Natan Yellin

,

CEO

Natan has been writing software for over 15 years. He regularly posts about all things Kubernetes on LinkedIn.‍

Why and How to Audit Kubernetes Changes (Updated)

What is root cause analysis?