Prometheus 正常运行时间或 SLA 百分比超过 Grafana 中的滑动 window

Question

I want to create a Grafana 'singlestat' Panel that shows the Uptime or SLA 'percentage', based on the presence or absence of test failure metrics.我想创建一个 Grafana“singlestat”面板，根据是否存在测试失败指标来显示正常运行时间或 SLA“百分比”。

I already have the appropriate metric, e2e_tests_failure_count , for different test frameworks.对于不同的测试框架，我已经有了合适的指标e2e_tests_failure_count 。 This means that the following query returns the sum of observed test failures:这意味着以下查询返回观察到的测试失败的总和：

sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-2|test-framework-3",kubernetes_namespace="platform-edge"})

I already managed to create a graph that is "1" if everything is ok and "0" if there are any test failures:我已经设法创建了一个图表，如果一切正常则为“1”，如果有任何测试失败则为“0”：

1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-1|test-framework-1",kubernetes_namespace="platform-edge"}), 1)

I now want to have a single percentage value that shows the "uptime" (= amount of time the environment was 'helathy') over a period of time, eg the last 5 days.我现在想要一个百分比值来显示一段时间内的“正常运行时间”（= 环境“健康”的时间量），例如过去 5 天。 Something like "99.5%" or, more appropriate for the screenshot, "65%".类似于“99.5%”，或者更适合屏幕截图的“65%”。

I tried something like this:我试过这样的事情：

(1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"service-cvi-e2e-tests|service-svhb-e2e-tests|service-svh-roundtrip-e2e-tests",kubernetes_namespace="platform-edge"}), 1))[5d]

but this only results in parser errors.但这只会导致解析器错误。 Googling didn't really get me any further, so I'm hoping I can find help here:)谷歌搜索并没有真正让我更进一步，所以我希望我能在这里找到帮助:)

Answer 1

Just figured this out and I believe it is producing correct results. 刚想出来，我相信它产生了正确的结果。 You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query, as you have already discovered (you get a parse error). 您必须使用录制规则，因为您无法从单个查询中的函数的实例向量结果创建范围向量，正如您已经发现的那样（您得到一个解析错误）。 So we record the function result (which will be an instance vector) as a new time series and use that as the metric name in a different query, where you can then add the [5d] to select a range. 因此，我们将函数结果（将是实例向量）记录为新的时间序列，并将其用作不同查询中的度量标准名称，然后可以添加[5d]以选择范围。

We run our tests multiple times per minute against all our services, and each service ("service" is a label where each service's name is the label value) has a different number of tests associated with it, but if any of the tests for a given service fails, we consider that a "down moment". 我们针对所有服务每分钟多次运行测试，每个服务（“服务”是一个标签，其中每个服务的名称是标签值）具有与之关联的不同数量的测试，但是如果有任何测试鉴于服务失败，我们认为这是一个“失败的时刻”。 (The number of test failures for a given service is captured in the metrics with the status="failure" label value.) We clamp the number of failures to 1 so we only have zeroes and ones for our values and can therefore convert a "failure values time series" into a "success values time series" instead, using an inequality operator and the bool modifier. （给定服务的测试失败次数在具有status="failure"标签值的度量中捕获。）我们将失败次数限制为1，因此我们只有零和一个值用于我们的值，因此可以转换为“失败值时间序列“成为”成功值时间序列“而不是使用不等式运算符和bool修饰符。 (See this post for a discussion about the use of bool .) So the result of the first recorded metric is 1 for every service where all its tests succeeded during that scrape interval, and 0 where there was at least one test failure for that service. （有关使用bool的讨论，请参阅此文章。）因此，对于在该擦除间隔期间所有测试成功的每个服务，第一个记录度量的结果为1，对于该服务至少有一个测试失败，该结果为0 。

If the number of failures for a service is > 0 for all the values returned for any given minute, we consider that service to be "down" for that minute. 如果对于任何给定分钟返回的所有值，服务的失败次数> 0，我们认为该服务在该分钟内“关闭”。 (So if we have both a failure and a success in a given minute, that does not count as downtime.) That is why we have the second recorded metric to produce the actual "up for this minute" boolean values. （因此，如果我们在给定的分钟内同时出现故障和成功，则不会将其视为停机时间。）这就是为什么我们有第二个记录的度量标准来生成实际的“up this this”布尔值。 The second recorded metric builds on the first, which is OK since the Prometheus documentation says the recorded metrics are run in series within each group. 第二个记录的度量标准建立在第一个，这是正常的，因为Prometheus文档说记录的度量标准在每个组内串行运行。

So "Uptime" for any given duration is the sum of "up for this minute" values (ie 1 for each minute up) divided by the total number of minutes in the duration, whatever that duration happens to be. 因此，任何给定持续时间的“正常运行时间”是“此分钟的上升”值（即每分钟上升1）的总和除以持续时间中的总分钟数，无论该持续时间恰好是什么。

Since we have defined a recorded metric named "minute_up_bool", we can then create an uptime graph over whatever range we want. 由于我们已经定义了一个名为“minute_up_bool”的记录指标，因此我们可以在任何我们想要的范围内创建正常运行时间图。 (BTW, recorded metrics are only generated for times after you first define them, so you won't get yesterday's time series data included in a recorded metric you define today.) Here's a query you can put in Grafana to show uptime % over a moving window of the last 5 days: （顺便说一句，记录的指标仅在您首次定义它们之后的时间生成，因此您不会获得今天定义的记录指标中包含的昨天的时间序列数据。）这是您可以在Grafana中显示的查询，以显示正常运行时间％移动过去5天的窗口：

sum_over_time(minute_up_bool[5d]) * 100 / (5 * 24 * 60)

So this is our recording rule configuration: 所以这是我们的录制规则配置：

groups:
- name: uptime
  interval: 1m
  # Each rule here builds on the previous one.
  rules:
  # Get test results as pass/fail => 1/0
  # (label_replace() removes confusing status="failure" label value)
  - record: test_success_bool
    expr: label_replace(clamp_max(test_statuses_total{status="failure"}, 1), "status", "", "", "") != bool 1
  # Get the uptime as 1 minute range where the sum of successes is not zero
  - record: minute_up_bool
    expr: clamp_max(sum_over_time(test_success_bool[1m]), 1)

Answer 2

You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query您必须使用记录规则，因为您无法在单个查询中从 function 的实例向量结果创建范围向量

Actually you can, by using a subquery :其实你可以，通过使用子查询：

(...some complicated instant subexpression...)[5d:1m]

This gives the same results as if you'd used a recording rule with a 1 minute evaluation interval.这会产生与您使用评估间隔为 1 分钟的记录规则相同的结果。 The recording rule is still beneficial though, as it avoids recomputing the subexpression every time.不过，记录规则仍然是有益的，因为它避免了每次都重新计算子表达式。

Prometheus 正常运行时间或 SLA 百分比超过 Grafana 中的滑动 window

问题描述

2 个解决方案

解决方案1
3 2018-05-31 19:26:25

解决方案2
0 2023-01-05 09:03:33

Prometheus 正常运行时间或 SLA 百分比超过 Grafana 中的滑动 window

问题描述

2 个解决方案

解决方案1 3 2018-05-31 19:26:25

解决方案2 0 2023-01-05 09:03:33

解决方案1
3 2018-05-31 19:26:25

解决方案2
0 2023-01-05 09:03:33