简体   繁体   English

Prometheus-Alertmanager警报的复杂规则/过滤器

[英]Complex rules/filters for Prometheus-Alertmanager Alerts

Situation: I have Prometheus and Alertmanager setup to monitor, among other things, CPU temp of various devices. 情况:我有Prometheus和Alertmanager设置来监控各种设备的CPU温度等。 Alertmanager sends alerts from production devices to PagerDuty. Alertmanager将生产设备的警报发送到PagerDuty。

The devices I'm monitoring have different models with different operating specs. 我正在监控的设备有不同的型号,具有不同的操作规格。 Normal CPU temp for models 1-5 is 50C, while for model 6 it's 70C. 1-5型的正常CPU温度为50℃,而型号6的CPU温度为70℃。 Currently the threshold for the CPU temp alerts is 60C, so PagerDuty keeps getting alerts from model 6 devices that are operating at their normal temperature. 目前,CPU临时警报的阈值为60C,因此PagerDuty不断从正常温度下运行的6型设备获取警报。

Is there a way to filter out cpu temp alerts from only model 6 devices if the temp is below 80C and still get cpu temp alerts for model 1-5 devices at 60C? 如果温度低于80℃,是否有办法从模型6设备中过滤出CPU临时警报,并且仍然可以在60℃时获得模型1-5设备的CPU临时警报?

Note: There are lots of other metrics that are being monitored, but for all of them other than CPU temp, all device models have the exact same thresholds. 注意:有许多其他指标正在被监控,但对于CPU temp以外的所有指标,所有设备模型都具有完全相同的阈值。

Here is a snippet from my alertmanager.yml that sends prod alerts to PagerDuty 这是我的alertmanager.yml一个片段,它向alertmanager.yml发送prod警报

- match:
    stack_name: prod
    severity: critical
  receiver: PagerDuty

Admittedly, I don't have a great deal of YML experience. 不可否认,我没有大量的YML经验。 but this is what I'm hoping to do, but I'm not sure of the correct syntax: 但这是我希望做的,但我不确定正确的语法:

- match:
    stack_name: prod
    severity: critical
    alertname: !device_cpu_temperature
  receiver: PagerDuty
- match:
    stack_name: prod
    severity: critical
    alertname: device_cpu_temperature
    uuid: !*6X*
  receiver: PagerDuty
- match: 
    stack_name: prod
    severity: critical
    alertname: device_cpu_temperature
    uuid: *6X*
    value: >80
  receiver: PagerDuty

Desired outcome: 期望的结果:

  • All critical prod alerts except device_cpu_temperature are sent to PagerDuty 除device_cpu_temperature之外的所有关键prod警报都将发送到PagerDuty
  • Critical prod device_cpu_temperature alerts are only sent to PagerDuty if the model number isn't 6 (uuid contains the model number followed by an 'X') 如果型号不是6,则关键产品device_cpu_temperature警报仅发送到PagerDuty(uuid包含型号后跟“X”)
  • Critical prod device_cpu_temperature alerts from model 6 devices are sent to PagerDuty only if the cpu temp is above 80C. 仅当cpu temp高于80C时,才会将来自型号6设备的严重prod device_cpu_temperature警报发送到PagerDuty。

Or would it be better to have 2 different alert rules in prometheus? 或者在普罗米修斯有2个不同的警报规则会更好吗? Can certain rules be applied to only certain devices? 某些规则是否只适用于某些设备? If so, how? 如果是这样,怎么样?

The easier would be to create different alert rules in Prometheus. 更容易在普罗米修斯创建不同的警报规则。

Actually the alert manager is only meant to send, group, filter, etc alerts, not to evaluate metrics. 实际上,警报管理器仅用于发送,分组,过滤等警报,而不是评估指标。

You can achieve this with two different alerts in Prometheus configuration, filtering by hostname or any other label provided by the exporter. 您可以使用Prometheus配置中的两个不同警报,按主机名过滤或导出器提供的任何其他标签来实现此目的。

The expression for servers 1-5 should be something like this: 服务器1-5的表达式应该是这样的:

 - alert: ServiceProbeFailed
   expr: cpu_temperature{hostname!~".*server_6.*"} > 50

And the rule for server 6: 而服务器6的规则:

 - alert: ServiceProbeFailed
   expr: cpu_temperature{hostname=~".*server_6.*"} > 70

The alerts have the same name so for the alert manager will be the same alert. 警报具有相同的名称,因此警报管理器将是相同的警报。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM