如何根据 Prometheus 警报运行 pod

Question

有什么办法可以根据 Prometheus 发出的警报运行 pod？ 我们有一个场景，我们需要根据磁盘压力阈值执行一个 pod。 我能够创建警报，但我需要执行一个 pod。 我怎样才能做到这一点？

groups:
  - name: node_memory_MemAvailable_percent
    rules:
    - alert: node_memory_MemAvailable_percent_alert
      annotations:
        description: Memory on node {{ $labels.instance }} currently at {{ $value }}% 
          is under pressure
        summary: Memory usage is under pressure, system may become unstable.
      expr: |
        100 - ((node_memory_MemAvailable_bytes{job="node-exporter"} * 100) / node_memory_MemTotal_bytes{job="node-exporter"}) > 80
      for: 2m
      labels:
        severity: warning

Answer 1

我认为 Alertmanager 可以帮助你，使用webhook接收器（文档）。

这样，当警报被触发时，Prometheus 将其发送到 Alertmanager，然后 Alertmanager 对自定义 webhook 执行 POST。

当然，您需要实现一个服务来处理警报并运行您的操作。

Answer 2

通常，您的问题显示磁盘压力，在代码中我可以看到可用内存量。 如果您想根据内存上下扩展副本，您可以实现Horizontal Pod Autoscaler ：

Horizontal Pod Autoscaler 实现为一个控制循环，其周期由控制器管理器的--horizontal-pod-autoscaler-sync-period标志控制（默认值为 15 秒）。

在每个时间段内，控制器管理器根据每个 HorizontalPodAutoscaler 定义中指定的指标查询资源利用率。 控制器管理器从资源指标 API（对于每个 Pod 资源指标）或自定义指标 API（对于所有其他指标）获取指标。

您可以根据内存利用率创建自己的 HPA。 这是示例：

apiVersion: autoscaling/v2beta2 
kind: HorizontalPodAutoscaler
metadata:
  name: php-memory-scale 
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: php-apache 
  minReplicas: 1 
  maxReplicas: 10 
  metrics: 
  - type: Resource
    resource:
      name: memory 
      target:
        type: Utilization 
        averageValue: 10Mi

您还可以使用来自 Prometheus 的自定义指标创建自定义Kubernetes HPA ：

自动缩放是一种根据资源使用情况自动扩大或缩小工作负载的方法。 K8s水平 Pod 自动缩放器：

被实现为一个控制循环，它通过metrics.k8s.io API 定期查询Resource Metrics API 以获得核心指标，例如CPU/内存和用于应用程序特定指标的Custom Metrics API（external.metrics.k8s.io 或custom.metrics.k8s.io。 metrics.k8s.io API。它们由指标解决方案供应商提供的“适配器”API服务器提供。有一些已知的解决方案，但这些实现都不是Kubernetes的正式组成部分）

根据观察到的指标自动扩展部署或副本集中的 pod 数量。

在接下来的内容中，我们将重点关注自定义指标，因为自定义指标 API 使Prometheus等监控系统能够将特定于应用程序的指标公开给 HPA 控制器。

另一种解决方案可能是使用KEDA 。 看看这个指南。 这是用于监控来自 nginx 的 500 个错误的示例 yaml：

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
 name: nginx-scale
 namespace: keda-hpa
spec:
 scaleTargetRef:
   kind: Deployment
   name: nginx-server
 minReplicaCount: 1
 maxReplicaCount: 5
 cooldownPeriod: 30
 pollingInterval: 1
 triggers:
 - type: prometheus
   metadata:
     serverAddress: https://prometheus_server/prometheus
     metricName: nginx_connections_waiting_keda
     query: |
       sum(nginx_connections_waiting{job="nginx"})
     threshold: "500"

Answer 3

是的，我们有 webhook，但是我们通过使用 am executor 作为来自 am executor 自定义脚本的自定义服务实现了服务，我们已经从 ado 管道运行了所需的作业

Answer 4

您可以使用名为Robusta 的开源项目来完成此操作。 （免责声明：我是维护者。）

首先，定义要触发的 Prometheus 警报：

customPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: DiskSpaceAlertName
  actions:
  - disk_watcher: {}

其次，我们需要编写触发时运行的实际操作。 （上面称为disk_watcher 。）如果有人已经根据您的需要编写了操作，您可以跳过此步骤，因为已经有 50 多个内置操作。

在这种情况下，没有内置操作，因此我们需要用 Python 编写一个。 （不过我很乐意添加一个内置的：）

@action
def disk_watcher(event: DeploymentEvent):
    deployment = event.get_deployment()

    # read / modify the resources here
    print(deployment.spec.template.spec.containers[0].resources)
    deployment.update()

    # fetch the relevant pod
    pod = RobustaPod.find_pod(deployment.metadata.name, deployment.metadata.namespace)

    # see what is using up disk space
    output = pod.exec("df -h")

    # create another pod
    other_output = RobustaPod.exec_in_debugger_pod("my-new-pod", pod.spec.nodeName, "cmd to run", "my-image")

    # send details to slack or any other destination
    event.add_enrichment([
        MarkdownBlock(f"the output from df is attached"),
        FileBlock("df.txt", output.encode()),
        FileBlock("other.txt", other_output.encode())
    ])

如何根据 Prometheus 警报运行 pod

问题描述

4 个解决方案

解决方案1
1 2021-11-16 13:04:58

解决方案2
0 2021-11-16 13:41:27

解决方案3
0 已采纳 2021-12-05 05:08:54

解决方案4
0 2022-01-02 08:27:55

如何根据 Prometheus 警报运行 pod

问题描述

4 个解决方案

解决方案1 1 2021-11-16 13:04:58

解决方案2 0 2021-11-16 13:41:27

解决方案3 0 已采纳 2021-12-05 05:08:54

解决方案4 0 2022-01-02 08:27:55

解决方案1
1 2021-11-16 13:04:58

解决方案2
0 2021-11-16 13:41:27

解决方案3
0 已采纳 2021-12-05 05:08:54

解决方案4
0 2022-01-02 08:27:55