普羅米修斯警報管理器不發送警報 k8s

Question

我使用 prometheus 操作員 0.3.4 和警報管理器 0.20 並且它不起作用，即我看到警報被觸發（在警報選項卡上的 prometheus UI 上），但我沒有收到任何電子郵件警報。 通過查看日志，我看到以下內容，有什么想法嗎？ 請看粗體警告也許這就是原因，但不知道如何解決它...

這是我使用的普羅米修斯運算符的掌舵： https : //github.com/helm/charts/tree/master/stable/prometheus-operator

level=info ts=2019-12-23T15:42:28.039Z caller=main.go:231 msg="Starting Alertmanager" version="(version=0.20.0, branch=HEAD, revision=f74be0400a6243d10bb53812d6fa408ad71ff32d)"
level=info ts=2019-12-23T15:42:28.039Z caller=main.go:232 build_context="(go=go1.13.5, user=root@00c3106655f8, date=20191211-14:13:14)"
level=warn ts=2019-12-23T15:42:28.109Z caller=cluster.go:228 component=cluster msg="failed to join cluster" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n"
level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:230 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-12-23T15:42:28.109Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n"
level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-12-23T15:42:28.131Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-12-23T15:42:28.132Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail
level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail2
level=info ts=2019-12-23T15:42:28.135Z caller=main.go:497 msg=Listening address=:9093
level=info ts=2019-12-23T15:42:30.110Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.00011151s
level=info ts=2019-12-23T15:42:38.110Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.000659096s

這是我的配置 yaml

global:
  imagePullSecrets: []


prometheus-operator:
  defaultRules:
  grafana:
    enabled: true
  prometheusOperator:
    tolerations:
      - key: "WorkGroup"
        operator: "Equal"
        value: "operator"
        effect: "NoSchedule"
      - key: "WorkGroup"
        operator: "Equal"
        value: "operator"
        effect: "NoExecute"
    tlsProxy:
      image:
        repository: squareup/ghostunnel
        tag: v1.4.1
        pullPolicy: IfNotPresent
    resources:
      limits:
        cpu: 8000m
        memory: 2000Mi
      requests:
        cpu: 2000m
        memory: 2000Mi
    admissionWebhooks:
      patch:
        priorityClassName: "operator-critical"
        image:
          repository: jettech/kube-webhook-certgen
          tag: v1.0.0
          pullPolicy: IfNotPresent
    serviceAccount:
      name: prometheus-operator
    image:
      repository: quay.io/coreos/prometheus-operator
      tag: v0.34.0
      pullPolicy: IfNotPresent

  prometheus:
    prometheusSpec:
      replicas: 1
      serviceMonitorSelector:
        role: observeable
      tolerations:
        - key: "WorkGroup"
          operator: "Equal"
          value: "operator"
          effect: "NoSchedule"
        - key: "WorkGroup"
          operator: "Equal"
          value: "operator"
          effect: "NoExecute"
      ruleSelector:
        matchLabels:
          role: alert-rules
          prometheus: prometheus
      image:
        repository: quay.io/prometheus/prometheus
        tag: v2.13.1
  alertmanager:
    alertmanagerSpec:
      image:
        repository: quay.io/prometheus/alertmanager
        tag: v0.20.0
      resources:
        limits:
          cpu: 500m
          memory: 1000Mi
        requests:
          cpu: 500m
          memory: 1000Mi
    serviceAccount:
      name: prometheus
    config:
      global:
        resolve_timeout: 1m
        smtp_smarthost: 'smtp.gmail.com:587'
        smtp_from: 'alertmanager@vsx.com'
        smtp_auth_username: 'ds.monitoring.grafana@gmail.com'
        smtp_auth_password: 'mypass'
        smtp_require_tls: false
      route:
        group_by: ['alertname', 'cluster']
        group_wait: 45s
        group_interval: 5m
        repeat_interval: 1h
        receiver: default-receiver
        routes:
          - receiver: str
            match_re:
              cluster: "canary|canary2"

      receivers:
        - name: default-receiver
        - name: str
          email_configs:
          - to: 'rayndoll007@gmail.com'
            from: alertmanager@vsx.com
            smarthost: smtp.gmail.com:587
            auth_identity: ds.monitoring.grafana@gmail.com
            auth_username: ds.monitoring.grafana@gmail.com
            auth_password: mypass

        - name: 'AlertMail'
          email_configs:
            - to: 'rayndoll007@gmail.com'

https://codebeautify.org/yaml-validator/cb6a2781

該錯誤表示它在解析中失敗，名為alertmanager-monitoring-prometheus-oper-alertmanager-0的 pod 名稱已啟動並正在運行，但它嘗試解決：查找alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc不知道為什么...

這是kubectl get svc -n mon的輸出

更新這是警告日志

level=warn ts=2019-12-24T12:10:21.293Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094
level=warn ts=2019-12-24T12:10:21.323Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-1.alertmanager-operated.monitoring.svc:9094
level=warn ts=2019-12-24T12:10:21.326Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-2.alertmanager-operated.monitoring.svc:9094

這是kubectl get svc -n mon

alertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   6m4s
monitoring-grafana                        ClusterIP   100.11.215.226   <none>        80/TCP                       6m13s
monitoring-kube-state-metrics             ClusterIP   100.22.248.232   <none>        8080/TCP                     6m13s
monitoring-prometheus-node-exporter       ClusterIP   100.33.130.77    <none>        9100/TCP                     6m13s
monitoring-prometheus-oper-alertmanager   ClusterIP   100.33.228.217   <none>        9093/TCP                     6m13s
monitoring-prometheus-oper-operator       ClusterIP   100.21.229.204   <none>        8080/TCP,443/TCP             6m13s
monitoring-prometheus-oper-prometheus     ClusterIP   100.22.93.151    <none>        9090/TCP                     6m13s
prometheus-operated                       ClusterIP   None             <none>        9090/TCP                     5m54s

Answer 1

正確的調試步驟可以幫助處理這些場景：

啟用 Alertmanager 調試日志：添加參數 --log.level=debug
驗證 Alertmanager 集群是否正確形成（檢查 /status 端點並驗證列出了所有對等點）
驗證 Prometheus 是否正在向所有 Alertmanager 對等點發送警報（檢查 /status 端點並驗證所有 Alertmanager 對等點都已列出）
端到端測試：生成一個測試警報，應該在 Prometheus UI 中看到警報，然后應該在 Alertmanager UI 中看到警報，最后應該看到警報通知。

普羅米修斯警報管理器不發送警報 k8s

問題描述

1 個解決方案

解決方案1
0 2020-10-12 11:06:51

普羅米修斯警報管理器不發送警報 k8s

問題描述

1 個解決方案

解決方案1 0 2020-10-12 11:06:51

解決方案1
0 2020-10-12 11:06:51