简体繁体 English

不要使用promQL显示Grafana中重新部署的pod的数据

[英]Don't show data from redeployed pod in Grafana using promQL

原文 2019-08-09 10:16:54 9 1 grafana/ prometheus/ promql

I have a PromQL query that is looking at max latency per quantile and displays the data in Grafana, but it shows data from a pod that is redeployed and no longer exists. 我有一个PromQL查询，它查看每个分位数的最大延迟并在Grafana中显示数据，但它显示了已重新部署且不再存在的Pod中的数据。 The pod is younger than the staleness period of 15 days. 吊舱比15天的失效时期还年轻。

Here's the query: max(latency{quantile="..."}) 查询如下： max(latency{quantile="..."})

The max latency found is from the time it was throttling, and shortly after it got redeployed and went back to normal, and now I want to look only at the max latency of what is currently live. 找到的最大延迟时间是从调整时间开始，到重新部署并恢复正常后不久，现在我只想看看当前活动的最大延迟时间。

All the info that I found so far about staleness says it should be filtering behind the scenes, but doesn't look like it's happening in the current setup and I cannot figure out what should I change. 到目前为止，我发现的所有有关陈旧性的信息都表明它应该在幕后进行过滤，但是看起来好像不是在当前设置中正在发生，所以我不知道应该更改什么。

When adding manually in the query the specific instance ID - it works well, but the ID will change once it gets redeployed: max(latency{quantile="...", exported_instance="ID"}) 在查询中手动添加特定实例ID时，它会很好地工作，但是一旦重新部署ID，它就会更改： max(latency{quantile="...", exported_instance="ID"})

Here is a long list of similar questions I found, some are not answered, some are not asking for the same. 这是我发现的一长串类似问题，有些没有得到回答，有些没有要求相同。 The ideas that I did find that are somewhat relevant but don't solve the problem in a sustainable way are: 我确实发现的想法有些相关，但不能以可持续的方式解决问题：

Suggestions from the links below that were not helpful 以下链接中的建议没有帮助

change staleness period, won't work because it affects the whole system 更改陈旧期，因为它会影响整个系统，所以将不起作用
restart Prometheus, won't work because it can't be done every time a pod is redeployed 重新启动Prometheus，将无法正常工作，因为每次重新部署Pod都无法完成
list each graph per machine, won't work with a max query 列出每台计算机的每个图形，不适用于max查询

Links to similar questions 链接到类似问题

How do I deal with old collected metrics in Prometheus? 如何处理Prometheus中收集的旧指标？ Switch prom->elk: log based monitoring 切换prom-> elk：基于日志的监视
Get data from prometheus only from last scrape iteration Staleness is a relevant concept, in Singlestat it shows how to use only current value 仅从上一次刮擦迭代中从普罗米修斯获取数据陈旧性是一个相关概念，在Singlestat中，它显示了如何仅使用当前值
Grafana dashboard showing deleted information from prometheus Default retention is 15 days, hide machines with a checkbox Grafana仪表板显示从Prometheus中删除的信息默认保留时间为15天，使用复选框隐藏计算机
How can I delete old Jobs from Prometheus? 如何从Prometheus删除旧的Jobs？ Manual query/restart 手动查询/重启
grafana variable still catch old metrics info Update prometheus targets grafana变量仍捕获旧指标信息更新Prometheus目标
Clear old data in Grafana Delete with prometheus settings 使用Prometheus设置清除Grafana Delete中的旧数据
https://community.grafana.com/t/prometheus-push-gateway/18835 Not answered https://community.grafana.com/t/prometheus-push-gateway/18835未回答
https://www.robustperception.io/staleness-and-promql Explains how new staleness works without examples https://www.robustperception.io/staleness-and-promql解释没有示例的新陈旧性的工作原理

The end goal 最终目标

is displaying the max latency between all sources that are live now, dropping data from no longer existing sources. 正在显示当前活动的所有源之间的最大延迟，从而删除不再存在的源中的数据。

1 个解决方案

You can use auto generated metric named up to isolate your required metrics from others. 您可以使用自动生成的指标命名up以将所需指标与其他指标隔离。 You can easily determine which metric sources are offline from up metric. 您可以轻松确定哪些度量标准来源从up度量标准脱机。

up{job="", instance=""}: 1 if the instance is healthy, ie reachable, or 0 if the scrape failed. up {job =“”，instance =“”}：如果实例正常（即可达），则为1；如果刮擦失败，则为0。