如何根据 Prometheus 警报运行 pod

Question

Is there any way we can run pod based on the alert fired from Prometheus?有什么办法可以根据 Prometheus 发出的警报运行 pod？ We have a scenario where we need to execute a pod based on the disk pressure threshold.我们有一个场景，我们需要根据磁盘压力阈值执行一个 pod。 I am able to create alert but I need to execute a pod.我能够创建警报，但我需要执行一个 pod。 How can I achieve that?我怎样才能做到这一点？

groups:
  - name: node_memory_MemAvailable_percent
    rules:
    - alert: node_memory_MemAvailable_percent_alert
      annotations:
        description: Memory on node {{ $labels.instance }} currently at {{ $value }}% 
          is under pressure
        summary: Memory usage is under pressure, system may become unstable.
      expr: |
        100 - ((node_memory_MemAvailable_bytes{job="node-exporter"} * 100) / node_memory_MemTotal_bytes{job="node-exporter"}) > 80
      for: 2m
      labels:
        severity: warning

Answer 1

I think the Alertmanager can help you, using the webhook receiver ( documentation ).我认为 Alertmanager 可以帮助你，使用webhook接收器（文档）。

In this way, when the alert is triggered, Prometheus sends it to the Alertmanager, then the Alertmanager does a POST to a custom webhook.这样，当警报被触发时，Prometheus 将其发送到 Alertmanager，然后 Alertmanager 对自定义 webhook 执行 POST。

Of course, you need to implement a service that handles the alert and runs your action.当然，您需要实现一个服务来处理警报并运行您的操作。

Answer 2

Generally, your question shows disk pressure, and in the code I can see the amount of memory available.通常，您的问题显示磁盘压力，在代码中我可以看到可用内存量。 If you want to scale your replicas up and down based on your memory you can implement Horizontal Pod Autoscaler :如果您想根据内存上下扩展副本，您可以实现Horizontal Pod Autoscaler ：

The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled by the controller manager's --horizontal-pod-autoscaler-sync-period flag (with a default value of 15 seconds). Horizontal Pod Autoscaler 实现为一个控制循环，其周期由控制器管理器的--horizontal-pod-autoscaler-sync-period标志控制（默认值为 15 秒）。

During each period, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition.在每个时间段内，控制器管理器根据每个 HorizontalPodAutoscaler 定义中指定的指标查询资源利用率。 The controller manager obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).控制器管理器从资源指标 API（对于每个 Pod 资源指标）或自定义指标 API（对于所有其他指标）获取指标。

You can create your own HPA based on memory utilization .您可以根据内存利用率创建自己的 HPA。 Here is the example:这是示例：

apiVersion: autoscaling/v2beta2 
kind: HorizontalPodAutoscaler
metadata:
  name: php-memory-scale 
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: php-apache 
  minReplicas: 1 
  maxReplicas: 10 
  metrics: 
  - type: Resource
    resource:
      name: memory 
      target:
        type: Utilization 
        averageValue: 10Mi

You can also create your custom Kubernetes HPA with custom metrics from Prometheus :您还可以使用来自 Prometheus 的自定义指标创建自定义Kubernetes HPA ：

Autoscaling is an approach to automatically scale up or down workloads based on the resource usage.自动缩放是一种根据资源使用情况自动扩大或缩小工作负载的方法。 The K8s Horizontal Pod Autoscaler : K8s水平 Pod 自动缩放器：

is implemented as a control loop that periodically queries the Resource Metrics API for core metrics, through metrics.k8s.io API, like CPU/memory and the Custom Metrics API for application-specific metrics (external.metrics.k8s.io or custom.metrics.k8s.io API. They are provided by “adapter” API servers offered by metrics solution vendors. There are some known solutions , but none of those implementations are officially part of Kubernetes)被实现为一个控制循环，它通过metrics.k8s.io API 定期查询Resource Metrics API 以获得核心指标，例如CPU/内存和用于应用程序特定指标的Custom Metrics API（external.metrics.k8s.io 或custom.metrics.k8s.io。 metrics.k8s.io API。它们由指标解决方案供应商提供的“适配器”API服务器提供。有一些已知的解决方案，但这些实现都不是Kubernetes的正式组成部分）

automatically scales the number of pods in a deployment or replica set based on the observed metrics.根据观察到的指标自动扩展部署或副本集中的 pod 数量。

In what follows we'll focus on the custom metrics because the Custom Metrics API made it possible for monitoring systems like Prometheus to expose application-specific metrics to the HPA controller.在接下来的内容中，我们将重点关注自定义指标，因为自定义指标 API 使Prometheus等监控系统能够将特定于应用程序的指标公开给 HPA 控制器。

Another solution might be to use KEDA .另一种解决方案可能是使用KEDA 。 Look at this guide .看看这个指南。 Here is example yaml for monitoring 500 errors from nginx:这是用于监控来自 nginx 的 500 个错误的示例 yaml：

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
 name: nginx-scale
 namespace: keda-hpa
spec:
 scaleTargetRef:
   kind: Deployment
   name: nginx-server
 minReplicaCount: 1
 maxReplicaCount: 5
 cooldownPeriod: 30
 pollingInterval: 1
 triggers:
 - type: prometheus
   metadata:
     serverAddress: https://prometheus_server/prometheus
     metricName: nginx_connections_waiting_keda
     query: |
       sum(nginx_connections_waiting{job="nginx"})
     threshold: "500"

Answer 3

是的，我们有 webhook，但是我们通过使用 am executor 作为来自 am executor 自定义脚本的自定义服务实现了服务，我们已经从 ado 管道运行了所需的作业

Answer 4

You can do this with an open source project called Robusta .您可以使用名为Robusta 的开源项目来完成此操作。 (Disclaimer: I'm the maintainer.) （免责声明：我是维护者。）

First, define which Prometheus alert you want to trigger on:首先，定义要触发的 Prometheus 警报：

customPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: DiskSpaceAlertName
  actions:
  - disk_watcher: {}

Second, we need to write the actual action that runs when triggered.其次，我们需要编写触发时运行的实际操作。 (Called disk_watcher above.) You can skip this step if someone already wrote an action for your need, as there are 50+ builtin actions already. （上面称为disk_watcher 。）如果有人已经根据您的需要编写了操作，您可以跳过此步骤，因为已经有 50 多个内置操作。

In this case, there is no built-in action so we need to write one in Python.在这种情况下，没有内置操作，因此我们需要用 Python 编写一个。 (I would be happy to add a builtin one though :) （不过我很乐意添加一个内置的：）

@action
def disk_watcher(event: DeploymentEvent):
    deployment = event.get_deployment()

    # read / modify the resources here
    print(deployment.spec.template.spec.containers[0].resources)
    deployment.update()

    # fetch the relevant pod
    pod = RobustaPod.find_pod(deployment.metadata.name, deployment.metadata.namespace)

    # see what is using up disk space
    output = pod.exec("df -h")

    # create another pod
    other_output = RobustaPod.exec_in_debugger_pod("my-new-pod", pod.spec.nodeName, "cmd to run", "my-image")

    # send details to slack or any other destination
    event.add_enrichment([
        MarkdownBlock(f"the output from df is attached"),
        FileBlock("df.txt", output.encode()),
        FileBlock("other.txt", other_output.encode())
    ])

如何根据 Prometheus 警报运行 pod

问题描述

4 个解决方案

解决方案1
1 2021-11-16 13:04:58

解决方案2
0 2021-11-16 13:41:27

解决方案3
0 已采纳 2021-12-05 05:08:54

解决方案4
0 2022-01-02 08:27:55

如何根据 Prometheus 警报运行 pod

问题描述

4 个解决方案

解决方案1 1 2021-11-16 13:04:58

解决方案2 0 2021-11-16 13:41:27

解决方案3 0 已采纳 2021-12-05 05:08:54

解决方案4 0 2022-01-02 08:27:55

解决方案1
1 2021-11-16 13:04:58

解决方案2
0 2021-11-16 13:41:27

解决方案3
0 已采纳 2021-12-05 05:08:54

解决方案4
0 2022-01-02 08:27:55