简体   繁体   English

有没有办法使用 prometheus 监控 kube cron 作业

[英]Is there a way to monitor kube cron jobs using prometheus

Is there a way to monitor kube cronjob.有没有办法监控 kube cronjob。

I have a kube cronjob which runs every 10mins on my cluster.. is there a way to collect metrics everytime my cronjob fails due to some error or notify when my cronjob has not been completed after a certain period of time.我有一个 kube cronjob,它在我的集群上每 10 分钟运行一次。

I'm using these rules with kube-state-metrics :我将这些规则与kube-state-metrics 一起使用

groups:
- name: job.rules
  rules:
  - alert: CronJobRunning
    expr: time() -kube_cronjob_next_schedule_time > 3600
    for: 1h
    labels:
      severity: warning
    annotations:
      description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
      summary: CronJob didn't finish after 1h

  - alert: JobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job completion is taking more than 1h to complete
        cronjob {{$labels.namespaces}}/{{$labels.job}}
      summary: Job {{$labels.job}} didn't finish to complete after 1h

  - alert: JobFailed
    expr: kube_job_status_failed  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
      summary: Job failed

The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create.这里棘手的部分是 cronjobs 本身没有有用的状态,您必须将它们与它们创建的工作相匹配。 I've written up an article on how to achieve this:我写了一篇关于如何实现这一目标的文章:

https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511 https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511

The article goes into a bit of detail as to how things work, but the alert config is as follow:这篇文章详细介绍了事情的工作原理,但警报配置如下:

groups:
- name: kube-cron
  rules:
  - record: job_cronjob:kube_job_status_start_time:max
    expr: |
      label_replace(
        label_replace(
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (exported_job, label_cronjob)
          == ON(label_cronjob) GROUP_LEFT()
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (label_cronjob),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")

  - record: job_cronjob:kube_job_status_failed:sum
    expr: |
  clamp_max(
        job_cronjob:kube_job_status_start_time:max,
      1)
      * ON(job) GROUP_LEFT()
      label_replace(
        label_replace(
          (kube_job_status_failed != 0),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")


  - alert: CronJobStatusFailed
    expr: |
      job_cronjob:kube_job_status_failed:sum
      * ON(cronjob) GROUP_RIGHT()
      kube_cronjob_labels
      > 0
    for: 1m
    annotations:
      description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'

The jobTemplate must include a label called cronjob that matches the name of the cronjob object. jobTemplate 必须包含一个名为cronjob的标签,该标签与 cronjob 对象的名称相匹配。

The way to monitoring cronjobs with Prometheus is to have them push a metric indicating the last time they succeeded to the pushgateway.使用 Prometheus 监控 cronjobs 的方法是让他们推送一个指标,指示他们上次成功推送网关的时间。 You can then alert on if the cronjob hasn't succeeded recently enough.如果 cronjob 最近还没有成功,您可以发出警报。

You can get the info you want from here .你可以从这里得到你想要的信息。

CronJobs create Jobs on a schedule, so you can simple look at kube_job_status_failed for the jobs that are created, one caveat is the job name has an epoch time at the end. CronJobs 按计划创建作业,因此您可以简单地查看 kube_job_status_failed 以了解创建的作业,需要注意的是作业名称最后有一个纪元时间。

To ensure alerts resolve themselves I'm using the following query in alert manager:为了确保警报自行解决,我在警报管理器中使用以下查询:

increase(kube_job_status_failed{job=~"mytestjob-.*"}[5m]) > 1

My cron is:我的 cron 是:

*/5 * * * *`, and I set `backoffLimit: 2

to limit number of failures per run.限制每次运行的失败次数。

kube-state-metrics 导出器还包括各种 CronJob 相关指标: https : //github.com/kubernetes/kube-state-metrics/blob/master/Documentation/cronjob-metrics.md ,但不幸的是似乎不包括成功 CronJob 成功/失败。

All answers so far are unaware of namespaces, and are dependent on custom labeling in the Job .到目前为止,所有答案都不知道命名空间,并且依赖于Job自定义标签。

The latter can be fixed as kube-state-metrics version 1.6.0 introduced a new metric kube_job_owner which solves the problem matching Job s and CronJob s.后者可以修复,因为 kube-state-metrics 版本 1.6.0 引入了一个新的度量标准kube_job_owner来解决匹配Job s 和CronJob s 的问题。

NOTE: In kube-state-metrics 1.4.0 the job label was renamed to job_name to avoid a label collision with Prometheus.注意:在 kube-state-metrics 1.4.0 中, job标签被重命名为job_name以避免与 Prometheus 的标签冲突。

clamp_max(
  max by (namespace, owner_name, job_name) (
    max by (namespace, owner_name, job_name) (
      kube_job_status_start_time
      *
      on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
    )
    ==
    on (namespace, owner_name) group_left max by (namespace, owner_name) (
      kube_job_status_start_time
      *
      on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
    )
  ),
  1
)
*
on (namespace, job_name) group_left kube_job_status_failed

The output can be further improved by renaming the owner_name label to cronjob by surrounding the expression with输出可以通过将owner_name标签重命名为cronjob来进一步改进,方法是将表达式包含在

max without (owner_name) (
  label_replace(
    <expression from above>
  ,
  "cronjob", "$1", "owner_name", "(.+)"
  )
)

(the label_replace() function adds a new cronjob label, while max() drops the owner_name label) label_replace()函数添加了一个新的cronjob标签,而max()删除了owner_name标签)

I was able to simplify this medium post (label_replace was not working for me for some reason) https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511我能够简化这篇中等文章(由于某种原因,label_replace 对我不起作用) https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511

My cron query looks like this (we have "component" labels on all cronjobs instead of "cronjob", but you can use your favorite label)我的 cron 查询看起来像这样(我们在所有 cronjob 上都有“组件”标签而不是“cronjob”,但您可以使用自己喜欢的标签)

clamp_max(max(
    kube_job_status_start_time
    * ON(job) GROUP_RIGHT()
    kube_job_labels{label_component!=""}
  ) BY (job, label_component)
  == ON(label_component) GROUP_LEFT()
  max(
    kube_job_status_start_time
    * ON(job) GROUP_RIGHT()
    kube_job_labels{label_component!=""}
) BY (label_component), 1) * ON(job) GROUP_LEFT() 
kube_job_status_failed

Plug this into the prometheus expression dashboard to make sure you get results (1 means the cron failed the last time, 0 means it succeeded or hasn't run yet).将其插入 prometheus 表达式仪表板以确保获得结果(1 表示 cron 上次失败,0 表示成功或尚未运行)。

For alerting, add != 0 , and the query will return with ANY cronjob that failed.对于警报,添加!= 0 ,查询将返回任何失败的 cronjob。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有办法使用 prometheus 监控 kubernetes 中的 tls 证书? - Is there a way to monitor tls certificates in kubernetes using prometheus? K8S - 使用 Prometheus 以安全的方式监控另一个 prometheus 实例 - K8S - using Prometheus to monitor another prometheus instance in secure way 使用 kube-prometheus 时无法抓取其他命名空间 - Unable to scrape other namespaces when using kube-prometheus 使用 prometheus pod 监控 golang webapp pod - using prometheus pod to monitor a golang webapp pod 使用Prometheus监控自定义kubernetes pod指标 - Monitor custom kubernetes pod metrics using Prometheus 如何使用 Prometheus 监控 docker 的多个实例? - how to monitor multiple instances of docker using Prometheus? kube-state-metrics 不使用服务监视器发送指标 - kube-state-metrics not sending metrics using service monitor kube-prometheus-stack 升级 Prometheus 版本 - kube-prometheus-stack Upgrade Prometheus Version 有没有办法查询 Prometheus 以计算时间范围内的失败作业? - Is there a way to query Prometheus to count failed jobs in time range? 如何使用 helm bitnami/mongodb 和 kube-prometheus-stack 设置 mongodb grafana 仪表板 - How to setup a mongodb grafana dashboard using helm bitnami/mongodb and kube-prometheus-stack
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM