简体   繁体   中英

Graphite and statsd, averaging percentile, stddev incorrect

Since statsd calculates statistics for each flush interval (default 10 secs), it seems incorrect for Graphite to simply average these when looking at a longer time window. For example, statsd sends the 90th percentile for 6 flush intervals. If I'm looking at the data in 1 minute buckets, Graphite averages these. It's not accurate to just take the average of 6 ten-second percentiles to create the 90th percentile of the minute.

This is a problem with the other statistics too: mean, median, stddev. For min/max/count it's easy to setup the Graphite storage-aggregation to correctly aggregate. But for statistics it isn't correct.

How are people handling this?

You can't. Extracting the percentiles is inherently a lossy operation that cannot be reversed.

The arithmetic mean for the minute can be computed by getting the summing all the values for the 6 intervals and dividing by the sum of the count for all six intervals to restore the accurate mean for the entire minute; not exactly straightforward.

I've been thinking about the issue too.

Let's take the example of an ICMP check where you are measuring packet loss to a service. You are submitting the min,max,avg,90p of your check, every 10 seconds.

Here's my thoughts:

  1. This problem doesn't apply for non sampled values (ie. if there's only one value per 10 seconds).

  2. If you're sending some sort of sampled measurement for your time period measurement (ie. min,max,percentiles), whether through statsd or from the check directly, things get complicated.

    • min and max are easy. You can roll things up that way directly (as you point out)
    • count is also a special case that is handled, as you note..

But when it comes to percentiles.... things get really messy.

I think that being able to roll-up/flush with a computed percentile would greatly alleviate the problem.

I'm not sure this is technically a graphite problem, but I feel that everyone who is using graphite to "visualize" percentile data has got to be running into this.. but I haven't been able to find that much information online.

For now, if you want accurate visualization of percentile data for arbitrary time periods with rolled up periods, you're going to have to use something like ElasticSearch and go right to the source data (in this case, the results of every ping that you used to derive your statistics)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM