简体   繁体   English

如何在特定时间打盹普罗米修斯警报

[英]How to snooze prometheus alert for specific time

I have faced some issues with Prometheus memory alert.我遇到了 Prometheus 内存警报的一些问题。 If I take the backup of Gitlab then memory usage going up to 95%.如果我备份 Gitlab,那么内存使用率会高达 95%。 I want to snooze memory alert for a specific time.我想暂停特定时间的内存警报。

eg If I am taking a backup at 2 AM then I need to snooze Prometheus memory alert.例如,如果我在凌晨 2 点进行备份,那么我需要暂停 Prometheus 内存警报。 Is it possible?可能吗?

As Marcelo said, there is no way to schedule a silence but if the backup is made at regular interval (say every night from 2am to 3am), you can include that in the alert expression.正如 Marcelo 所说,没有办法安排静默,但如果定期进行备份(例如从凌晨 2 点到凌晨 3 点的每晚),您可以将其包含在警报表达式中。

- alert: OutOfMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 AND ON() absent(hour() >= 2 <= 3)

This can rapidly become tedious if you want to silence many rules (or if you want more complex schedules of inhibition).如果您想使许多规则保持沉默(或者如果您想要更复杂的抑制时间表),这可能会很快变得乏味。 In that case, you can useinhibition rule s of alert manager in the following way.在这种情况下,您可以通过以下方式使用警报管理器的抑制规则

First step is to define an alert, in Prometheus, fired at the time you want the inhibition to take place:第一步是在 Prometheus 中定义一个警报,在您希望抑制发生时触发:

- alert: BackupHours
  expr: hour() >= 2 <= 3
  for: 1m
  labels:
    notification: none
  annotations:
    description: 'This alert fires during backup hours to inhibit others'

Remember to add a route in alert manager to avoid notifying this alert:请记住在警报管理器中添加路由以避免通知此警报:

routes:
  - match:
      notification: none
    receiver: do_nothing
receivers:
- name: do_nothing

And then use inhibition rules to silence target rules during that time:然后在这段时间内使用禁止规则使目标规则静音:

inhibit_rules:
- source_match:
    alertname: BackupHours
  target_match:
    # here can be any other selection of alert
    alertname: OutOfMemory

Note that it only works out of the box for UTC computation.请注意,它仅适用于 UTC 计算。 If you need DST, it requires more boilerplate (with recording rules by example).如果您需要 DST,则需要更多样板文件(例如记录规则)。

As a side note, if you are monitoring your backup process, you may already have a metric that indicate the backup is under way.附带说明一下,如果您正在监控备份过程,您可能已经有一个指标表明备份正在进行中。 If so, you could use this metrics to inhibit the other alerts and you wouldn't need to maintain a schedule.如果是这样,您可以使用此指标来禁止其他警报,并且您不需要维护时间表。

No, it's not possible to have scheduled silences.不,不可能有预定的沉默。

Some workarounds for your case:针对您的情况的一些解决方法:

1) Maybe you can change your Prometheus configuration and increase the "for" clause to give more time to execute the backup without trigging the alert. 1)也许您可以更改您的 Prometheus 配置并增加“for”子句,以便在不触发警报的情况下有更多时间执行备份。

2) You can use the REST API to create/delete silences at the beginning/ending of the backup. 2) 您可以使用 REST API 在备份的开始/结束时创建/删除静音。

See more info about this subject here .在此处查看有关此主题的更多信息。

You can compare conditions back in history and therefore alert won't popup if metrics doesn't differ more than 2 times for the past two days at this time.您可以比较历史记录中的条件,因此如果此时指标在过去两天内的差异不超过 2 次,则不会弹出警报。

      - alert: CPULoadAlert
        # Condition for alerting
        expr: >-
          node_load5 / node_load5 offset 1d > 2 and
          node_load5 / node_load5 offset 2d > 2 and
          node_load5 > 1
        for: 5m
        # Annotation - additional informational labels to store more information
        annotations:
          summary: 'Instance {{ $labels.instance }} got an unusual high load on CPU'
          description: '{{ $labels.instance }} of job {{ $labels.job }} got CPU spike over 2x compared to previous 2 days.'
        # Labels - additional labels to be attached to the alert
        labels:
          severity: 'warning'

I would like to comment on @Michael Doubez, but I' do not have enough points yet.我想对@Michael Doubez 发表评论,但我还没有足够的分数。

I am writing an exporter that signals that a maintenance window is active and that metric can then be used to inhibit alerts using an inhibit rule.我正在编写一个导出器,它表示维护窗口处于活动状态,然后该指标可用于使用禁止规则来禁止警报。 You can define multiple maintenance windows with an good old fashioned cron expression.您可以使用良好的老式 cron 表达式定义多个维护窗口。 See https://github.com/jzandbergen/maintenance-exporterhttps://github.com/jzandbergen/maintenance-exporter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM