Don’t get me wrong: Prometheus is awesome, but having too many alerts is not awesome. If Slack dings and ten alerts drop into your alerting channel at once, five times a day, then something needs to change.
Most of the time you don’t take action to fix alerts, because they’re just that alert again. You saw the alert yesterday and will see it again tomorrow. It’s not critical, but you wont delete it because occasionally it means something.
Don’t worry, you’re not alone. The good news is: you’re not the problem. The way you handle alerts was designed in the 90s for monolithic applications, not microservices on Kubernetes.
On Kubernetes you run tens of applications, use countless technologies, and almost always are missing the actual data that you need to fix an alert.
For example, let’s say an alert comes in for HighCPUThrottling, the most common Prometheus CPU throttling alert on Kubernetes.
Guess what? You can’t fix it because you’re missing data. Even the most complex monitoring in the world can’t help you, because the solution depends on a number of simple questions:
Did you define CPU requests and/or limits for this application?
How much CPU is the application using compared to its request/limit?
What other applications are running on the same node?
No APM can help you, because APMs just draw graphs and the solution for HighCPUThrottling can’t be expressed in a graph.
To fix an alert like HighCPUThrottling, you need a decision tree, but traditional monitoring solutions don’t do decision trees. Neither does Prometheus. You simply can’t write a PromQL query that outputs what to do when HighCPUThrottling occurs.
The good news? Anyone who can write code and is an expert on Kubernetes scheduling could write a ten line function to solve it. The function would take a bunch of inputs and output the bottom line like “increase your CPU request to 400 millicpu”. Actually, if someone wrote a function like that, you wouldn’t even need to be a Kubernetes scheduling expert or Python wizard to use it. You could just read the output.
For that matter, why not add code to that function that actually fixes the issue? Of course the code shouldn’t change your YAML files in the cluster because those will just be overridden next time you apply from git. But there’s no reason why it can’t tell you exactly what changes to your YAML would fix the annoying alert that keeps showing up.
It's not just CPU throttling - this is true for OOM Kill alerts (out of memory alerts), and many, many other alerts.
It’s time for a new generation of alerting tools which don’t just show you problems, but show you solutions. The way you work today just doesn’t work for microservices. There are too many problems and they’re too complex to solve with dumb graphs.
Don’t get me wrong, next generation monitoring needs something like Prometheus to trigger alerts, but it can’t stop there. We need a new generation of tools that analyze those alerts and recommend pinpoint fixes.
We built Robusta.dev to be that tool. It’s open source and does exactly what we described above. Obviously we’re biased, but we think it’s really, really good. Our users love it too.