简体   繁体   English

警报在 Prometheus 上触发,但在 Alertmanager 上未触发

[英]Alerts firing on Prometheus but not on Alertmanager

I can't seem to find out why Alertmanager is not getting alerts from Prometheus.我似乎无法找出为什么 Alertmanager 没有收到来自 Prometheus 的警报。 I would appreciate a swift assistance on this challenge.我将不胜感激在这一挑战中的迅速帮助。 I'm fairly new with using Prometheus and Alertmanager.我对使用 Prometheus 和 Alertmanager 还很陌生。 I am using a webhook for MsTeams to push the notifications from alertmanager.我正在使用 MsTeams 的 webhook 来推送来自警报管理器的通知。

Alertmanager.yml警报管理器.yml

global:
  resolve_timeout: 5m


route:
  group_by: ['critical','severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'alert_channel'


receivers:
- name: 'alert_channel'
  webhook_configs:
  - url: 'http://localhost:2000/alert_channel'
    send_resolved: true

prometheus.yml - (Just a part of it) prometheus.yml - (只是其中的一部分)

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - alert_rules.yml

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'kafka'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'

    static_configs:
    - targets: ['localhost:8080']
      labels:
        service: 'Kafka'

alertmanager.service警报管理器服务

[Unit]
Description=Prometheus Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/data/alertmanager \
  --web.listen-address=127.0.0.1:9093

Restart=always

[Install]
WantedBy=multi-user.target

alert_rules警报规则在此处输入图片说明

groups:
- name: alert_rules
  rules:
  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: "critical"
    annotations:
      summary: "Service {{ $labels.service }} down!"
      description: "{{ $labels.service }} of job {{ $labels.job }} has been down for more than 1 minute."


  - alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 25
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host out of memory (instance {{ $labels.instance }})"
      description: "Node memory is filling up (< 25% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/"}  * 100) / node_filesystem_size_bytes{mountpoint="/"} < 40
    for: 1s
    labels:
      severity: warning
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 40% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Prometheus Alerts普罗米修斯警报在此处输入图片说明

But I don't see those alerts on alertmanager但是我在alertmanager上看不到这些警报在此处输入图片说明

I'm out of ideas at this point.我在这一点上没有想法。 Please I need help.请我需要帮助。 I've been on this since last week.我从上周开始就在做这件事。

You have a mistake in your Alertmanager configuration.您的 Alertmanager 配置有误。 group_by expects a collection of label names and from what I am seeing critical is a label value, not the name. group_by需要一组标签名称,我认为critical是标签值,而不是名称。 So simply remove critical and you should be good to go.因此,只需删除critical ,您就可以开始使用了。

Also check out this blog posts, quite helpful https://www.robustperception.io/whats-the-difference-between-group_interval-group_wait-and-repeat_interval另请查看此博客文章,非常有用https://www.robustperception.io/whats-the-difference-between-group_interval-group_wait-and-repeat_interval


Edit 1编辑 1

If you want the receiver alert_channel to only receive alerts that have the severity critical you have to create a route and with a match attribute.如果你想接收alert_channel只接收有严重警报critical ,你必须创建一个路由,以match属性。

Something along these lines:沿着这些路线的东西:

route:
  group_by: ['...']  # good if very low volum
  group_wait: 15s
  group_interval: 5m
  repeat_interval: 1h
  routes:
    - match:
        - severity: critical
      receiver: alert_channel

Edit 2编辑 2

If this does not work as well try out this:如果这不起作用,请尝试以下操作:

route:
  group_by: ['...']
  group_wait: 15s
  group_interval: 5m
  repeat_interval: 1h
  receiver: alert_channel

This should work.这应该有效。 Check your Prometheus logs and see if you find hints there检查您的 Prometheus 日志,看看您是否在那里找到了提示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM