简体   繁体   中英

Complex rules/filters for Prometheus-Alertmanager Alerts

Situation: I have Prometheus and Alertmanager setup to monitor, among other things, CPU temp of various devices. Alertmanager sends alerts from production devices to PagerDuty.

The devices I'm monitoring have different models with different operating specs. Normal CPU temp for models 1-5 is 50C, while for model 6 it's 70C. Currently the threshold for the CPU temp alerts is 60C, so PagerDuty keeps getting alerts from model 6 devices that are operating at their normal temperature.

Is there a way to filter out cpu temp alerts from only model 6 devices if the temp is below 80C and still get cpu temp alerts for model 1-5 devices at 60C?

Note: There are lots of other metrics that are being monitored, but for all of them other than CPU temp, all device models have the exact same thresholds.

Here is a snippet from my alertmanager.yml that sends prod alerts to PagerDuty

- match:
    stack_name: prod
    severity: critical
  receiver: PagerDuty

Admittedly, I don't have a great deal of YML experience. but this is what I'm hoping to do, but I'm not sure of the correct syntax:

- match:
    stack_name: prod
    severity: critical
    alertname: !device_cpu_temperature
  receiver: PagerDuty
- match:
    stack_name: prod
    severity: critical
    alertname: device_cpu_temperature
    uuid: !*6X*
  receiver: PagerDuty
- match: 
    stack_name: prod
    severity: critical
    alertname: device_cpu_temperature
    uuid: *6X*
    value: >80
  receiver: PagerDuty

Desired outcome:

  • All critical prod alerts except device_cpu_temperature are sent to PagerDuty
  • Critical prod device_cpu_temperature alerts are only sent to PagerDuty if the model number isn't 6 (uuid contains the model number followed by an 'X')
  • Critical prod device_cpu_temperature alerts from model 6 devices are sent to PagerDuty only if the cpu temp is above 80C.

Or would it be better to have 2 different alert rules in prometheus? Can certain rules be applied to only certain devices? If so, how?

The easier would be to create different alert rules in Prometheus.

Actually the alert manager is only meant to send, group, filter, etc alerts, not to evaluate metrics.

You can achieve this with two different alerts in Prometheus configuration, filtering by hostname or any other label provided by the exporter.

The expression for servers 1-5 should be something like this:

 - alert: ServiceProbeFailed
   expr: cpu_temperature{hostname!~".*server_6.*"} > 50

And the rule for server 6:

 - alert: ServiceProbeFailed
   expr: cpu_temperature{hostname=~".*server_6.*"} > 70

The alerts have the same name so for the alert manager will be the same alert.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM