简体   繁体   中英

Why does Prometheus resolve unresolved alerts?

I have job failure alerts in prometheus, which resolves itself right after 2 hours I got the alert where the alert actually is not resolved. How come Prometheus resolves it? Just so you know, this is only happening with this job alert.

Job Alert:

  - alert: Failed Job Status
    expr: increase(kube_job_status_failed[30m]) > 0
    for: 1m
    labels:
      severity: deploy_slack
    annotations:
      identifier: '{{ $labels.namespace }} {{ $labels.job_name }}'
      description: '{{ $labels.namespace }} - {{ $labels.job_name }} Failed'

An example of the alert:

At 3:01 pm
[FIRING:1] Failed Job Status @ <environment-name> <job-name>
<environment-name> - <job-name> Failed

At 5:01 pm
[RESOLVED]
Alerts Resolved:
- <environment-name> - <job-name>: <environment-name> - <job-name> Failed

Here's the related pods as it can be seen that nothing seems to be resolved.

这里的脚本

Thanks for your help in advance!

kube_job_status_failed is a gauge representing the number of failed job pods at a given time. The expression increase(kube_job_status_failed[30m]) > 0 asks the question: "have there been new failures in the last 30 minutes?" If there haven't, it won't be true, even if old failures remain in the Kubernetes API.

A refinement of this approach is sum(rate(kube_job_status_failed[5m])) by (namespace, job_name) > 0 , plus an alert manager configuration to not send resolved notices for this alert. This is because a job pod failure is an event that can't be reversed - the job could be retried, but the pod can't be un-failed so resolution only means the alert has "aged out" or the pods have been deleted.

An expression that looks at the current number of failures recorded in the API server is sum(kube_job_status_failed) by (namespace, job_name) > 0 . An alert based on this could be "resolved", but only by the Job objects being removed from the API (which doesn't necessarily mean that a process has succeeded...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM