As a DevOps or SRE, you likely monitor dozens of applications and services with the goal of keeping them up and running all the time. To do so effectively, you probably rely on alerts from Prometheus and similar systems to get notifications about important issues. But all too often those alerts can be noisy and instead of trusting those alerts to tell you something important, you find yourself ignoring them. When this happens, you and your team can miss out on critical alerts.
If you’ve ever muted a Prometheus notification channel because it was too noisy, this post is for you.
The “too-many alerts” problem is often called alert fatigue. Alert fatigue refers to getting bombarded with hundreds of alerts, leading to an overwhelming amount of noise, which in turn leads to ignoring alerts.
Here are three practical steps to reduce alert fatigue and help your teams identify real issues faster.
Step 1: Enrich Alerts with Additional Context
By default, alerts from AlertManager can lack context. Here is one such alert:
An alert like this requires your team to do a data-gathering step to find the real issue. This is time-consuming, and most alerts typically need some such attention.
To make the process faster, you can automatically add context about the alert inside the notification. This will help your team decide if the alert is significant or not. Here are a couple of example alerts with more context.
AlertManager’s Notification Templates can be used to achieve this by configuring it to include additional fields. Here’s an example
templates:
- "/path/to/custom/template.tmpl"
receivers:
- name: 'team-slack'
slack_configs:
- channel: '#alerts'
text: |
Alert: {{ .CommonLabels.alertname }}
Instance: {{ .CommonLabels.instance }}
CPU Usage: {{ query 'avg(rate(node_cpu_seconds_total[5m]))' }}
We've configured Robusta alerts out of the box for you to include labels, as well as other data like the application logs, Kubernetes events, resource graphs etc. Here’s an example of a OOMKilled alert with additional enrichment.
Using Robusta OSS, you can get enriched alerts like without any extra configuration. Learn how to get started by connecting your Alert Manager with Robusta here.
Step 2: Group Similar Alerts
Now that our alerts are enriched with extra information, each individual alert is more clear. But what about the total number of alerts? How can we reduce the volume?
One easy solution is to group similar alerts and thus reduce the total number of notifications.
Robusta lets you group alerts based on the cluster, namespace, alert name, and more. Here is an example notification, grouping 1985 notifications into a single message .
As you can see, with Robusta your teams can receive fewer notifications, without missing anything.
Configuring alert grouping with Robusta OSS is as simple as defining a few Helm values. Follow the guide here for more information.
Step 3: Dynamically Route Alerts
Another common mistake that leads to alert fatigue is sending teams alerts that are not relevant to their work.
For example, a support team that is not on duty shouldn't receive alerts when they are off-shift. And a team responsible for a Payment Processor service should not receive alerts related to the Login service maintained by a different team.
In both these cases, it's important to route alerts to the right people at the right time, so that each team gets fewer alerts in total, and every alert is relevant.
Here’s a flow-chart showing the logic for an example routing strategy:
Using Robusta OSS, it is easy to configure routing rules like the above. For more details and practical examples, check out the Notification & Routing section in the docs.
Improving your own Prometheus Alerts with Robusta Open Source
Try improving your own alerts with the three steps covered in this post.
By spending just a few minutes on improving alerts, you can have a huge impact on notifications received by dozens of people in your company every day. This effort will pay itself off within a short time.
What’s more, by reducing alert fatigue, you’ll reduce the chance of missing a critical notification about something that could bring production down.
The Robusta Open Source is freely available under the MIT license. Check out the documentation or GitHub. You can also join our Slack community and ask questions if you want help.