[英]Alerts firing on Prometheus but not on Alertmanager
我似乎無法找出為什么 Alertmanager 沒有收到來自 Prometheus 的警報。 我將不勝感激在這一挑戰中的迅速幫助。 我對使用 Prometheus 和 Alertmanager 還很陌生。 我正在使用 MsTeams 的 webhook 來推送來自警報管理器的通知。
警報管理器.yml
global:
resolve_timeout: 5m
route:
group_by: ['critical','severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'alert_channel'
receivers:
- name: 'alert_channel'
webhook_configs:
- url: 'http://localhost:2000/alert_channel'
send_resolved: true
prometheus.yml - (只是其中的一部分)
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- alert_rules.yml
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'kafka'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'
static_configs:
- targets: ['localhost:8080']
labels:
service: 'Kafka'
警報管理器服務
[Unit]
Description=Prometheus Alert Manager
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/data/alertmanager \
--web.listen-address=127.0.0.1:9093
Restart=always
[Install]
WantedBy=multi-user.target
groups:
- name: alert_rules
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: "critical"
annotations:
summary: "Service {{ $labels.service }} down!"
description: "{{ $labels.service }} of job {{ $labels.job }} has been down for more than 1 minute."
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 25
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{ $labels.instance }})"
description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"} < 40
for: 1s
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 40% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
我在這一點上沒有想法。 請我需要幫助。 我從上周開始就在做這件事。
您的 Alertmanager 配置有誤。 group_by
需要一組標簽名稱,我認為critical
是標簽值,而不是名稱。 因此,只需刪除critical
,您就可以開始使用了。
另請查看此博客文章,非常有用https://www.robustperception.io/whats-the-difference-between-group_interval-group_wait-and-repeat_interval
編輯 1
如果你想接收alert_channel
只接收有嚴重警報critical
,你必須創建一個路由,以match
屬性。
沿着這些路線的東西:
route:
group_by: ['...'] # good if very low volum
group_wait: 15s
group_interval: 5m
repeat_interval: 1h
routes:
- match:
- severity: critical
receiver: alert_channel
編輯 2
如果這不起作用,請嘗試以下操作:
route:
group_by: ['...']
group_wait: 15s
group_interval: 5m
repeat_interval: 1h
receiver: alert_channel
這應該有效。 檢查您的 Prometheus 日志,看看您是否在那里找到了提示
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.