简体   繁体   中英

why prometheus using the rate to calculate the request_duration

since the request_duration is just a counter, why we need to using the rate to calculate the duration, this is not meaningful.

histogram_quantile(0.99, sum by (le) (rate(server_request_duration_seconds_bucket[1m])))

suck as take example from: https://robert-scherbarth.medium.com/measure-request-duration-with-prometheus-and-golang-adc6f4ca05fe

As stated by Prometheus documentation :

rate(v range-vector) calculates the per-second average rate of increase of the time series in the range vector.

[...]

rate should only be used with counters. It is best suited for alerting, and for graphing of slow-moving counters.

It is only useful to get the "pace" or the frequency of evolution of a counter.

Example use case : get the requests per seconds rate based on the incoming request counter

The server_request_duration_seconds is a histogram . It consists of multiple buckets with the name server_request_duration_seconds_bucket (the _bucket suffix is added to the original histogram name) with the upper boundary encoded in le label. Each such a bucket represents a counter, which counts the number of samples with values up to le . For example, server_request_duration_seconds_bucket{le="0.5"} counts the number of requests with the duration up to 0.5 seconds.

The rate(server_request_duration_seconds_bucket[1m]) calculates the average per-second increase rate over the last minute individually per each bucket of server_request_duration_seconds histogram. Eg the end result of rate(...) is a distribution of the increase rate of all the buckets over the last minute. This histogram can be exposed at multiple instances (aka replicas or shards) of a single service. So, if you want calculating the aggregate quantile over all these instances, you need to wrap the rate() into sum() by (le) before passing it to histogram_quantile .

The end result of the histogram_quantile(0.9, sum(rate(server_request_duration_seconds_bucket[1m])) by (le)) is an estimated 99th percentile of server_request_duration_seconds histograms over the last minute, eg the maximum time in seconds needed for 99% of registered requests over the last minute.

Note that it is OK to use increase instead of rate when calculating the histogram_quantile - this shouldn't change the result, since increase returns the same distribution shape across buckets as rate .

PS rate and increase functions in Prometheus may return unexpected results because of extrapolation - see this issue . This may lead to less accurate results from histogram_quantile . If you experience this issue, then try VictoriaMetrics - this is a Prometheus-like monitoring system, which supports PromQL functionality via MetricsQL . Contrary to Prometheus, it doesn't use extrapolation for increase and rate calculations, so it is free from issues related to the extrapolation. Prometheus developers are going to fix these issues too - see this design doc .

PPS I'm the core developer of VictoriaMetrics.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM