[英]Is there a way to monitor kube cron jobs using prometheus
Is there a way to monitor kube cronjob.有没有办法监控 kube cronjob。
I have a kube cronjob which runs every 10mins on my cluster.. is there a way to collect metrics everytime my cronjob fails due to some error or notify when my cronjob has not been completed after a certain period of time.我有一个 kube cronjob,它在我的集群上每 10 分钟运行一次。
I'm using these rules with kube-state-metrics :我将这些规则与kube-state-metrics 一起使用:
groups:
- name: job.rules
rules:
- alert: CronJobRunning
expr: time() -kube_cronjob_next_schedule_time > 3600
for: 1h
labels:
severity: warning
annotations:
description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
summary: CronJob didn't finish after 1h
- alert: JobCompletion
expr: kube_job_spec_completions - kube_job_status_succeeded > 0
for: 1h
labels:
severity: warning
annotations:
description: Job completion is taking more than 1h to complete
cronjob {{$labels.namespaces}}/{{$labels.job}}
summary: Job {{$labels.job}} didn't finish to complete after 1h
- alert: JobFailed
expr: kube_job_status_failed > 0
for: 1h
labels:
severity: warning
annotations:
description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
summary: Job failed
The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create.这里棘手的部分是 cronjobs 本身没有有用的状态,您必须将它们与它们创建的工作相匹配。 I've written up an article on how to achieve this:
我写了一篇关于如何实现这一目标的文章:
https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511 https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
The article goes into a bit of detail as to how things work, but the alert config is as follow:这篇文章详细介绍了事情的工作原理,但警报配置如下:
groups:
- name: kube-cron
rules:
- record: job_cronjob:kube_job_status_start_time:max
expr: |
label_replace(
label_replace(
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (exported_job, label_cronjob)
== ON(label_cronjob) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (label_cronjob),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- record: job_cronjob:kube_job_status_failed:sum
expr: |
clamp_max(
job_cronjob:kube_job_status_start_time:max,
1)
* ON(job) GROUP_LEFT()
label_replace(
label_replace(
(kube_job_status_failed != 0),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- alert: CronJobStatusFailed
expr: |
job_cronjob:kube_job_status_failed:sum
* ON(cronjob) GROUP_RIGHT()
kube_cronjob_labels
> 0
for: 1m
annotations:
description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'
The jobTemplate must include a label called cronjob
that matches the name of the cronjob object. jobTemplate 必须包含一个名为
cronjob
的标签,该标签与 cronjob 对象的名称相匹配。
The way to monitoring cronjobs with Prometheus is to have them push a metric indicating the last time they succeeded to the pushgateway.使用 Prometheus 监控 cronjobs 的方法是让他们推送一个指标,指示他们上次成功推送网关的时间。 You can then alert on if the cronjob hasn't succeeded recently enough.
如果 cronjob 最近还没有成功,您可以发出警报。
You can get the info you want from here .你可以从这里得到你想要的信息。
CronJobs create Jobs on a schedule, so you can simple look at kube_job_status_failed for the jobs that are created, one caveat is the job name has an epoch time at the end. CronJobs 按计划创建作业,因此您可以简单地查看 kube_job_status_failed 以了解创建的作业,需要注意的是作业名称最后有一个纪元时间。
To ensure alerts resolve themselves I'm using the following query in alert manager:为了确保警报自行解决,我在警报管理器中使用以下查询:
increase(kube_job_status_failed{job=~"mytestjob-.*"}[5m]) > 1
My cron is:我的 cron 是:
*/5 * * * *`, and I set `backoffLimit: 2
to limit number of failures per run.限制每次运行的失败次数。
kube-state-metrics 导出器还包括各种 CronJob 相关指标: https : //github.com/kubernetes/kube-state-metrics/blob/master/Documentation/cronjob-metrics.md ,但不幸的是似乎不包括成功 CronJob 成功/失败。
All answers so far are unaware of namespaces, and are dependent on custom labeling in the Job
.到目前为止,所有答案都不知道命名空间,并且依赖于
Job
自定义标签。
The latter can be fixed as kube-state-metrics version 1.6.0 introduced a new metric kube_job_owner
which solves the problem matching Job
s and CronJob
s.后者可以修复,因为 kube-state-metrics 版本 1.6.0 引入了一个新的度量标准
kube_job_owner
来解决匹配Job
s 和CronJob
s 的问题。
NOTE: In kube-state-metrics 1.4.0 the job
label was renamed to job_name
to avoid a label collision with Prometheus.注意:在 kube-state-metrics 1.4.0 中,
job
标签被重命名为job_name
以避免与 Prometheus 的标签冲突。
clamp_max(
max by (namespace, owner_name, job_name) (
max by (namespace, owner_name, job_name) (
kube_job_status_start_time
*
on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
)
==
on (namespace, owner_name) group_left max by (namespace, owner_name) (
kube_job_status_start_time
*
on (job_name) group_left(owner_name) max by (namespace, owner_name, job_name) (kube_job_owner{owner_kind="CronJob"})
)
),
1
)
*
on (namespace, job_name) group_left kube_job_status_failed
The output can be further improved by renaming the owner_name
label to cronjob
by surrounding the expression with输出可以通过将
owner_name
标签重命名为cronjob
来进一步改进,方法是将表达式包含在
max without (owner_name) (
label_replace(
<expression from above>
,
"cronjob", "$1", "owner_name", "(.+)"
)
)
(the label_replace()
function adds a new cronjob
label, while max()
drops the owner_name
label) (
label_replace()
函数添加了一个新的cronjob
标签,而max()
删除了owner_name
标签)
I was able to simplify this medium post (label_replace was not working for me for some reason) https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511我能够简化这篇中等文章(由于某种原因,label_replace 对我不起作用) https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
My cron query looks like this (we have "component" labels on all cronjobs instead of "cronjob", but you can use your favorite label)我的 cron 查询看起来像这样(我们在所有 cronjob 上都有“组件”标签而不是“cronjob”,但您可以使用自己喜欢的标签)
clamp_max(max(
kube_job_status_start_time
* ON(job) GROUP_RIGHT()
kube_job_labels{label_component!=""}
) BY (job, label_component)
== ON(label_component) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(job) GROUP_RIGHT()
kube_job_labels{label_component!=""}
) BY (label_component), 1) * ON(job) GROUP_LEFT()
kube_job_status_failed
Plug this into the prometheus expression dashboard to make sure you get results (1 means the cron failed the last time, 0 means it succeeded or hasn't run yet).将其插入 prometheus 表达式仪表板以确保获得结果(1 表示 cron 上次失败,0 表示成功或尚未运行)。
For alerting, add != 0
, and the query will return with ANY cronjob that failed.对于警报,添加
!= 0
,查询将返回任何失败的 cronjob。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.