如何根據 Prometheus 警報運行 pod

Question

有什么辦法可以根據 Prometheus 發出的警報運行 pod？ 我們有一個場景，我們需要根據磁盤壓力閾值執行一個 pod。 我能夠創建警報，但我需要執行一個 pod。 我怎樣才能做到這一點？

groups:
  - name: node_memory_MemAvailable_percent
    rules:
    - alert: node_memory_MemAvailable_percent_alert
      annotations:
        description: Memory on node {{ $labels.instance }} currently at {{ $value }}% 
          is under pressure
        summary: Memory usage is under pressure, system may become unstable.
      expr: |
        100 - ((node_memory_MemAvailable_bytes{job="node-exporter"} * 100) / node_memory_MemTotal_bytes{job="node-exporter"}) > 80
      for: 2m
      labels:
        severity: warning

Answer 1

我認為 Alertmanager 可以幫助你，使用webhook接收器（文檔）。

這樣，當警報被觸發時，Prometheus 將其發送到 Alertmanager，然后 Alertmanager 對自定義 webhook 執行 POST。

當然，您需要實現一個服務來處理警報並運行您的操作。

Answer 2

通常，您的問題顯示磁盤壓力，在代碼中我可以看到可用內存量。 如果您想根據內存上下擴展副本，您可以實現Horizontal Pod Autoscaler ：

Horizontal Pod Autoscaler 實現為一個控制循環，其周期由控制器管理器的--horizontal-pod-autoscaler-sync-period標志控制（默認值為 15 秒）。

在每個時間段內，控制器管理器根據每個 HorizontalPodAutoscaler 定義中指定的指標查詢資源利用率。 控制器管理器從資源指標 API（對於每個 Pod 資源指標）或自定義指標 API（對於所有其他指標）獲取指標。

您可以根據內存利用率創建自己的 HPA。 這是示例：

apiVersion: autoscaling/v2beta2 
kind: HorizontalPodAutoscaler
metadata:
  name: php-memory-scale 
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: php-apache 
  minReplicas: 1 
  maxReplicas: 10 
  metrics: 
  - type: Resource
    resource:
      name: memory 
      target:
        type: Utilization 
        averageValue: 10Mi

您還可以使用來自 Prometheus 的自定義指標創建自定義Kubernetes HPA ：

自動縮放是一種根據資源使用情況自動擴大或縮小工作負載的方法。 K8s水平 Pod 自動縮放器：

被實現為一個控制循環，它通過metrics.k8s.io API 定期查詢Resource Metrics API 以獲得核心指標，例如CPU/內存和用於應用程序特定指標的Custom Metrics API（external.metrics.k8s.io 或custom.metrics.k8s.io。 metrics.k8s.io API。它們由指標解決方案供應商提供的“適配器”API服務器提供。有一些已知的解決方案，但這些實現都不是Kubernetes的正式組成部分）

根據觀察到的指標自動擴展部署或副本集中的 pod 數量。

在接下來的內容中，我們將重點關注自定義指標，因為自定義指標 API 使Prometheus等監控系統能夠將特定於應用程序的指標公開給 HPA 控制器。

另一種解決方案可能是使用KEDA 。 看看這個指南。 這是用於監控來自 nginx 的 500 個錯誤的示例 yaml：

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
 name: nginx-scale
 namespace: keda-hpa
spec:
 scaleTargetRef:
   kind: Deployment
   name: nginx-server
 minReplicaCount: 1
 maxReplicaCount: 5
 cooldownPeriod: 30
 pollingInterval: 1
 triggers:
 - type: prometheus
   metadata:
     serverAddress: https://prometheus_server/prometheus
     metricName: nginx_connections_waiting_keda
     query: |
       sum(nginx_connections_waiting{job="nginx"})
     threshold: "500"

Answer 3

是的，我們有 webhook，但是我們通過使用 am executor 作為來自 am executor 自定義腳本的自定義服務實現了服務，我們已經從 ado 管道運行了所需的作業

Answer 4

您可以使用名為Robusta 的開源項目來完成此操作。 （免責聲明：我是維護者。）

首先，定義要觸發的 Prometheus 警報：

customPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: DiskSpaceAlertName
  actions:
  - disk_watcher: {}

其次，我們需要編寫觸發時運行的實際操作。 （上面稱為disk_watcher 。）如果有人已經根據您的需要編寫了操作，您可以跳過此步驟，因為已經有 50 多個內置操作。

在這種情況下，沒有內置操作，因此我們需要用 Python 編寫一個。 （不過我很樂意添加一個內置的：）

@action
def disk_watcher(event: DeploymentEvent):
    deployment = event.get_deployment()

    # read / modify the resources here
    print(deployment.spec.template.spec.containers[0].resources)
    deployment.update()

    # fetch the relevant pod
    pod = RobustaPod.find_pod(deployment.metadata.name, deployment.metadata.namespace)

    # see what is using up disk space
    output = pod.exec("df -h")

    # create another pod
    other_output = RobustaPod.exec_in_debugger_pod("my-new-pod", pod.spec.nodeName, "cmd to run", "my-image")

    # send details to slack or any other destination
    event.add_enrichment([
        MarkdownBlock(f"the output from df is attached"),
        FileBlock("df.txt", output.encode()),
        FileBlock("other.txt", other_output.encode())
    ])

如何根據 Prometheus 警報運行 pod

問題描述

4 個解決方案

解決方案1
1 2021-11-16 13:04:58

解決方案2
0 2021-11-16 13:41:27

解決方案3
0 已采納 2021-12-05 05:08:54

解決方案4
0 2022-01-02 08:27:55

如何根據 Prometheus 警報運行 pod

問題描述

4 個解決方案

解決方案1 1 2021-11-16 13:04:58

解決方案2 0 2021-11-16 13:41:27

解決方案3 0 已采納 2021-12-05 05:08:54

解決方案4 0 2022-01-02 08:27:55

解決方案1
1 2021-11-16 13:04:58

解決方案2
0 2021-11-16 13:41:27

解決方案3
0 已采納 2021-12-05 05:08:54

解決方案4
0 2022-01-02 08:27:55