使用 prometheus 查询缓存命中率图

Question

I'm using Caffeine cache with Spring Boot application.我将 Caffeine 缓存与 Spring Boot 应用程序一起使用。 All metrics are enabled, so I have them on Prometheus and Grafana.所有指标都已启用，因此我将它们放在 Prometheus 和 Grafana 上。

Based on cache_gets_total metric I want to build a HitRate graph.基于cache_gets_total指标，我想构建一个HitRate图。

I've tried to get a cache hits :我试图获得缓存命中：

delta(cache_gets_total{result="hit",name="myCache"}[1m])

and all gets from cache:并且都从缓存中获取：

sum(delta(cache_gets_total{name="myCache"}[1m]))

Both of the metrics works correctly and have values.这两个指标都正常工作并且具有值。 But when I'm trying to get a hit ratio, I have no data points.但是当我试图获得命中率时，我没有数据点。 Query I've tried:我试过的查询：

delta(cache_gets_total{result="hit",name="myCache"}[1m]) / sum(delta(cache_gets_total{name="myCache"}[1m]))

Why this query doesn't work and how to get a HitRate graph based on information, I have from Spring Boot and Caffeine?为什么这个查询不起作用以及如何根据信息获取 HitRate 图，我有来自 Spring Boot 和 Caffeine？

Answer 1

First of all, it is recommended to use increase() instead of delta for calculating the increase of the counter over the specified lookbehind window. The increase() function properly handles counter resets to zero, which may happen on service restart, while delta() would return incorrect results if the given lookbehind window covers counter resets.首先，建议使用increase()而不是delta来计算计数器在指定 lookbehind window 上的增加。 increase() function 正确处理计数器重置为零，这可能发生在服务重启时，而delta()如果给定的 lookbehind window 涵盖计数器重置，将返回不正确的结果。

Next, Prometheus searches for pairs of time series with identical sets of labels when performing / operation.接下来，Prometheus 在执行/操作时搜索具有相同标签集的时间序列对。 Then it applies individually the given operation per each pair of time series.然后它针对每对时间序列分别应用给定的操作。 Time series returned from increase(cache_gets_total{result="hit",name="myCache"}[1m]) have at least two labels: result="hit" and name="myCache" , while time series returned from sum(increase(cache_gets_total{name="myCache"}[1m])) have zero labels because sum removes all the labels after the aggregation. increase(cache_gets_total{result="hit",name="myCache"}[1m])返回的时间序列至少有两个标签： result="hit"和name="myCache" ，而sum(increase(cache_gets_total{name="myCache"}[1m])) ) 返回的时间序列sum(increase(cache_gets_total{name="myCache"}[1m]))的标签为零，因为sum在聚合后删除了所有标签。

Prometheus provides the solution to this issue - on() and group_left() modifiers. Prometheus 提供了这个问题的解决方案——on( on()和group_left()修饰符。 The on() modifier allows limiting the set of labels, which should be used when searching for time series pairs with identical labelsets, while the group_left() modifier allows matching multiple time series on the left side of / with a single time series on the right side of / operator. on()修饰符允许限制标签集，在搜索具有相同标签集的时间序列对时应使用该修饰符，而group_left()修饰符允许将左侧的多个时间序列/左侧的单个时间序列相匹配/运算符的右侧。 See these docs .请参阅这些文档。 So the following query should return cache hit rate:所以下面的查询应该返回缓存命中率：

increase(cache_gets_total{result="hit",name="myCache"}[1m])
  / on() group_left()
sum(increase(cache_gets_total{name="myCache"}[1m]))

There are alternative solutions exist:存在替代解决方案：

To remove all the labels from increase(cache_gets_total{result="hit",name="myCache"}[1m]) with sum() function:使用sum() function 从increase(cache_gets_total{result="hit",name="myCache"}[1m])中删除所有标签：

sum(increase(cache_gets_total{result="hit",name="myCache"}[1m]))
  /
sum(increase(cache_gets_total{name="myCache"}[1m]))

To wrap the right part of the query into scalar() function. This enables vector op scalar matching rules described here :将查询的右侧部分包装到scalar() function 中。这将启用此处描述的vector op scalar匹配规则：

increase(cache_gets_total{result="hit",name="myCache"}[1m])
  /
scalar(sum(increase(cache_gets_total{name="myCache"}[1m])))

It is also possible to get cache hit rate for all the caches with a single query via sum(...) by (name) template:也可以通过sum(...) by (name)模板通过单个查询获得所有缓存的缓存命中率：

sum(increase(cache_gets_total{result="hit"}[1m])) by (name)
  /
sum(increase(cache_gets_total[1m])) by (name)

Answer 2

Run both ("cache hits" and "all gets") queries individually in prometheus and compare label sets you get with results. 在Prometheus中分别运行两个查询（“缓存命中”和“所有获取”），并将获得的标签集与结果进行比较。 For "/" operation to work both sides have to have exactly the same labels (and values). 为了使“ /”操作起作用，双方必须具有完全相同的标签（和值）。 Usually some aggregation is required to "drop" unwanted dimensions/labels (like: if you already have one value from both queries then just wrap them both in sum() - before dividing). 通常，需要进行一些汇总才能“删除”不需要的维度/标签（例如：如果两个查询中已经有一个值，则只需将它们都包装在sum（）中-在除法之前）。

使用 prometheus 查询缓存命中率图

问题描述

2 个解决方案

解决方案1
1 2022-04-28 17:47:14

解决方案2
0 已采纳 2019-07-23 10:52:03

使用 prometheus 查询缓存命中率图

问题描述

2 个解决方案

解决方案1 1 2022-04-28 17:47:14

解决方案2 0 已采纳 2019-07-23 10:52:03

解决方案1
1 2022-04-28 17:47:14

解决方案2
0 已采纳 2019-07-23 10:52:03