Don’t get me wrong: Prometheus is awesome, but having too many alerts is not awesome. If Slack dings and ten alerts drop into your alerting channel at once, five times a day, then something needs to change.
Most of the time you don’t take action to fix alerts, because they’re just that alert again. You saw the alert yesterday and will see it again tomorrow. It’s not critical, but you wont delete it because occasionally it means something.
Don’t worry, you’re not alone. The good news is: you’re not the problem. The way you handle alerts was designed in the 90s for monolithic applications, not microservices on Kubernetes.
On Kubernetes you run tens of applications, use countless technologies, and almost always are missing the actual data that you need to fix an alert.
For example, let’s say an alert comes in for HighCPUThrottling, the most common Prometheus CPU throttling alert on Kubernetes.
Guess what? You can’t fix it because you’re missing data. Even the most complex monitoring in the world can’t help you, because the solution depends on a number of simple questions:
- Did you define CPU requests and/or limits for this application?
- How much CPU is the application using compared to its request/limit?
- What other applications are running on the same node?
- What are those applications’ requests and limits?
- Are you using an auto-scaling strategy?
