简体   繁体   English

Prometheus 正常运行时间或 SLA 百分比超过 Grafana 中的滑动 window

[英]Prometheus Uptime or SLA percentage over sliding window in Grafana

I want to create a Grafana 'singlestat' Panel that shows the Uptime or SLA 'percentage', based on the presence or absence of test failure metrics.我想创建一个 Grafana“singlestat”面板,根据是否存在测试失败指标来显示正常运行时间或 SLA“百分比”。

I already have the appropriate metric, e2e_tests_failure_count , for different test frameworks.对于不同的测试框架,我已经有了合适的指标e2e_tests_failure_count This means that the following query returns the sum of observed test failures:这意味着以下查询返回观察到的测试失败的总和:

sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-2|test-framework-3",kubernetes_namespace="platform-edge"})

I already managed to create a graph that is "1" if everything is ok and "0" if there are any test failures:我已经设法创建了一个图表,如果一切正常则为“1”,如果有任何测试失败则为“0”:

1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-1|test-framework-1",kubernetes_namespace="platform-edge"}), 1)

在此处输入图像描述

I now want to have a single percentage value that shows the "uptime" (= amount of time the environment was 'helathy') over a period of time, eg the last 5 days.我现在想要一个百分比值来显示一段时间内的“正常运行时间”(= 环境“健康”的时间量),例如过去 5 天。 Something like "99.5%" or, more appropriate for the screenshot, "65%".类似于“99.5%”,或者更适合屏幕截图的“65%”。

I tried something like this:我试过这样的事情:

(1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"service-cvi-e2e-tests|service-svhb-e2e-tests|service-svh-roundtrip-e2e-tests",kubernetes_namespace="platform-edge"}), 1))[5d]

but this only results in parser errors.但这只会导致解析器错误。 Googling didn't really get me any further, so I'm hoping I can find help here:)谷歌搜索并没有真正让我更进一步,所以我希望我能在这里找到帮助:)

Just figured this out and I believe it is producing correct results. 刚想出来,我相信它产生了正确的结果。 You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query, as you have already discovered (you get a parse error). 您必须使用录制规则,因为您无法从单个查询中的函数的实例向量结果创建范围向量,正如您已经发现的那样(您得到一个解析错误)。 So we record the function result (which will be an instance vector) as a new time series and use that as the metric name in a different query, where you can then add the [5d] to select a range. 因此,我们将函数结果(将是实例向量)记录为新的时间序列,并将其用作不同查询中的度量标准名称,然后可以添加[5d]以选择范围。

We run our tests multiple times per minute against all our services, and each service ("service" is a label where each service's name is the label value) has a different number of tests associated with it, but if any of the tests for a given service fails, we consider that a "down moment". 我们针对所有服务每分钟多次运行测试,每个服务(“服务”是一个标签,其中每个服务的名称是标签值)具有与之关联的不同数量的测试,但是如果有任何测试鉴于服务失败,我们认为这是一个“失败的时刻”。 (The number of test failures for a given service is captured in the metrics with the status="failure" label value.) We clamp the number of failures to 1 so we only have zeroes and ones for our values and can therefore convert a "failure values time series" into a "success values time series" instead, using an inequality operator and the bool modifier. (给定服务的测试失败次数在具有status="failure"标签值的度量中捕获。)我们将失败次数限制为1,因此我们只有零和一个值用于我们的值,因此可以转换为“失败值时间序列“成为”成功值时间序列“而不是使用不等式运算符和bool修饰符。 (See this post for a discussion about the use of bool .) So the result of the first recorded metric is 1 for every service where all its tests succeeded during that scrape interval, and 0 where there was at least one test failure for that service. (有关使用bool的讨论,请参阅此文章 。)因此,对于在该擦除间隔期间所有测试成功的每个服务,第一个记录度量的结果为1,对于该服务至少有一个测试失败,该结果为0 。

If the number of failures for a service is > 0 for all the values returned for any given minute, we consider that service to be "down" for that minute. 如果对于任何给定分钟返回的所有值,服务的失败次数> 0,我们认为该服务在该分钟内“关闭”。 (So if we have both a failure and a success in a given minute, that does not count as downtime.) That is why we have the second recorded metric to produce the actual "up for this minute" boolean values. (因此,如果我们在给定的分钟内同时出现故障和成功,则不会将其视为停机时间。)这就是为什么我们有第二个记录的度量标准来生成实际的“up this this”布尔值。 The second recorded metric builds on the first, which is OK since the Prometheus documentation says the recorded metrics are run in series within each group. 第二个记录的度量标准建立在第一个,这是正常的,因为Prometheus文档说记录的度量标准在每个组内串行运行。

So "Uptime" for any given duration is the sum of "up for this minute" values (ie 1 for each minute up) divided by the total number of minutes in the duration, whatever that duration happens to be. 因此,任何给定持续时间的“正常运行时间”是“此分钟的上升”值(即每分钟上升1)的总和除以持续时间中的总分钟数,无论该持续时间恰好是什么。

Since we have defined a recorded metric named "minute_up_bool", we can then create an uptime graph over whatever range we want. 由于我们已经定义了一个名为“minute_up_bool”的记录指标,因此我们可以在任何我们想要的范围内创建正常运行时间图。 (BTW, recorded metrics are only generated for times after you first define them, so you won't get yesterday's time series data included in a recorded metric you define today.) Here's a query you can put in Grafana to show uptime % over a moving window of the last 5 days: (顺便说一句,记录的指标仅在您首次定义它们之后的时间生成,因此您不会获得今天定义的记录指标中包含的昨天的时间序列数据。)这是您可以在Grafana中显示的查询,以显示正常运行时间%移动过去5天的窗口:

sum_over_time(minute_up_bool[5d]) * 100 / (5 * 24 * 60)

So this is our recording rule configuration: 所以这是我们的录制规则配置:

groups:
- name: uptime
  interval: 1m
  # Each rule here builds on the previous one.
  rules:
  # Get test results as pass/fail => 1/0
  # (label_replace() removes confusing status="failure" label value)
  - record: test_success_bool
    expr: label_replace(clamp_max(test_statuses_total{status="failure"}, 1), "status", "", "", "") != bool 1
  # Get the uptime as 1 minute range where the sum of successes is not zero
  - record: minute_up_bool
    expr: clamp_max(sum_over_time(test_success_bool[1m]), 1)

You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query您必须使用记录规则,因为您无法在单个查询中从 function 的实例向量结果创建范围向量

Actually you can, by using a subquery :其实你可以,通过使用子查询

(...some complicated instant subexpression...)[5d:1m]

This gives the same results as if you'd used a recording rule with a 1 minute evaluation interval.这会产生与您使用评估间隔为 1 分钟的记录规则相同的结果。 The recording rule is still beneficial though, as it avoids recomputing the subexpression every time.不过,记录规则仍然是有益的,因为它避免了每次都重新计算子表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 grafana 和 prometheus 计算正常运行时间的百分比 - How to calculate percentage of uptime using grafana and prometheus 如何使用 grafana singlestat 和 prometheus 计算正常运行时间百分比 - How to calculate uptime percentage using grafana singlestat and prometheus 如何计算 Prometheus Grafana 中的正常运行时间百分比或停机时间百分比 - How to calculate uptime % or Downtime % in Prometheus Grafana 计算多个 prometheus 指标的百分比并在 Grafana 中显示 - Calculate percentage of multiple prometheus metrics and display in Grafana 使用 LDAP 上的普罗米修斯数据源保护 grafana - Secure grafana with prometheus datasource all over LDAP Prometheus 将一段时间内的平均值存入 Grafana 表 - Prometheus average over a time period into Grafana table 在 grafana prometheus 中总结超过 unix 个时间戳 - Sum over unix timestamps in grafana prometheus cAdvisor 不显示所有容器的正常运行时间 (Prometheus+cAdvisor+Grafana) - cAdvisor does not show all container's uptime (Prometheus+cAdvisor+Grafana) 我试图根据Prometheus blackbox出口商的成功回应计算Grafana的正常运行时间 - Im trying to calculate uptime in Grafana based on the successful responses from Prometheus blackbox exporter Grafana-请求数量超过SLA - Grafana - Number of requests exceeds SLA
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM