[英]Complex rules/filters for Prometheus-Alertmanager Alerts
Situation: I have Prometheus and Alertmanager setup to monitor, among other things, CPU temp of various devices. 情况:我有Prometheus和Alertmanager设置来监控各种设备的CPU温度等。 Alertmanager sends alerts from production devices to PagerDuty.
Alertmanager将生产设备的警报发送到PagerDuty。
The devices I'm monitoring have different models with different operating specs. 我正在监控的设备有不同的型号,具有不同的操作规格。 Normal CPU temp for models 1-5 is 50C, while for model 6 it's 70C.
1-5型的正常CPU温度为50℃,而型号6的CPU温度为70℃。 Currently the threshold for the CPU temp alerts is 60C, so PagerDuty keeps getting alerts from model 6 devices that are operating at their normal temperature.
目前,CPU临时警报的阈值为60C,因此PagerDuty不断从正常温度下运行的6型设备获取警报。
Is there a way to filter out cpu temp alerts from only model 6 devices if the temp is below 80C and still get cpu temp alerts for model 1-5 devices at 60C? 如果温度低于80℃,是否有办法从模型6设备中过滤出CPU临时警报,并且仍然可以在60℃时获得模型1-5设备的CPU临时警报?
Note: There are lots of other metrics that are being monitored, but for all of them other than CPU temp, all device models have the exact same thresholds. 注意:有许多其他指标正在被监控,但对于CPU temp以外的所有指标,所有设备模型都具有完全相同的阈值。
Here is a snippet from my alertmanager.yml
that sends prod alerts to PagerDuty 这是我的
alertmanager.yml
一个片段,它向alertmanager.yml
发送prod警报
- match:
stack_name: prod
severity: critical
receiver: PagerDuty
Admittedly, I don't have a great deal of YML experience. 不可否认,我没有大量的YML经验。 but this is what I'm hoping to do, but I'm not sure of the correct syntax:
但这是我希望做的,但我不确定正确的语法:
- match:
stack_name: prod
severity: critical
alertname: !device_cpu_temperature
receiver: PagerDuty
- match:
stack_name: prod
severity: critical
alertname: device_cpu_temperature
uuid: !*6X*
receiver: PagerDuty
- match:
stack_name: prod
severity: critical
alertname: device_cpu_temperature
uuid: *6X*
value: >80
receiver: PagerDuty
Desired outcome: 期望的结果:
Or would it be better to have 2 different alert rules in prometheus? 或者在普罗米修斯有2个不同的警报规则会更好吗? Can certain rules be applied to only certain devices?
某些规则是否只适用于某些设备? If so, how?
如果是这样,怎么样?
The easier would be to create different alert rules in Prometheus. 更容易在普罗米修斯创建不同的警报规则。
Actually the alert manager is only meant to send, group, filter, etc alerts, not to evaluate metrics. 实际上,警报管理器仅用于发送,分组,过滤等警报,而不是评估指标。
You can achieve this with two different alerts in Prometheus configuration, filtering by hostname or any other label provided by the exporter. 您可以使用Prometheus配置中的两个不同警报,按主机名过滤或导出器提供的任何其他标签来实现此目的。
The expression for servers 1-5 should be something like this: 服务器1-5的表达式应该是这样的:
- alert: ServiceProbeFailed
expr: cpu_temperature{hostname!~".*server_6.*"} > 50
And the rule for server 6: 而服务器6的规则:
- alert: ServiceProbeFailed
expr: cpu_temperature{hostname=~".*server_6.*"} > 70
The alerts have the same name so for the alert manager will be the same alert. 警报具有相同的名称,因此警报管理器将是相同的警报。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.