简体   繁体   English

DropWizard 指标仪表与计时器

[英]DropWizard Metrics Meters vs Timers

I am learning the DropWizard Metrics library (formerly Coda Hale metrics) and I am confused as to when I should be using Meters vs Timers .我正在学习DropWizard Metrics 库(以前称为 Coda Hale 指标),我对何时应该使用MetersTimers感到困惑。 According to the docs:根据文档:

Meter: A meter measures the rate at which a set of events occur仪表:仪表测量一组事件发生的速率

and:和:

Timer: A timer is basically a histogram of the duration of a type of event and a meter of the rate of its occurrence计时器:计时器基本上是一种事件持续时间的直方图和其发生率的计量表

Based on these definitions, I can't discern the difference between these.根据这些定义,我无法辨别它们之间的区别。 What's confusing me is that Timer is not used the way I would have expected it to be used.让我感到困惑的是Timer没有按照我预期的方式使用它。 To me, Timer is just that: a timer;对我来说, Timer就是:一个定时器; it should measure the time diff between a start() and stop() .它应该测量start()stop()之间的时间差异。 But it appears that Timers also capture rates at which events occur, which feels like they are stepping on Meters toes.但似乎Timers也捕捉事件发生的速率,感觉就像他们踩着Meters脚趾一样。

If I could see an example of what each component outputs that might help me understand when/where to use either of these.如果我能看到每个组件输出的示例,这可能有助于我了解何时/何地使用其中任何一个。

You're confused in part because a DW Metrics Timer IS , among other things, a DW Metrics Meter.您感到困惑的部分原因是 DW Metrics TimerDW Metrics Meter。

A Meter is exclusively concerned with rates, measured in Hz (events per second). Meter 只与速率有关,以 Hz(每秒事件数)为单位。 Each Meter results in 4(?) distinct metrics being published:每个 Meter 导致发布 4(?) 个不同的指标:

  • a mean (average) rate since Metrics was started自 Metrics 启动以来的平均(平均)比率
  • 1, 5 and 15-minute rolling mean rates 1、5 和 15 分钟滚动平均费率

You use a Meter by recording a value at different points in your code -- DW Metrics automatically jots down the wall time of each call along with the value you gave it, and uses these to calculate the rate at which that value is increasing:您可以通过在代码中的不同点记录一个值来使用 Meter -- DW Metrics 会自动记下每次调用的挂壁时间以及您给它的值,并使用这些来计算该值增加的速率:

Meter getRequests = registry.meter("some-operation.operations")
getRequests.mark() //resets the value, e.g. sets it to 0
int numberOfOps = doSomeNumberOfOperations() //takes 10 seconds, returns 333
getRequests.mark(numberOfOps) //sets the value to number of ops.

We would expect our rates to be 33.3 Hz, as 333 operations occurred and the time between the two calls to mark() was 10 seconds.我们希望我们的速率为 33.3 Hz,因为发生了 333 次操作,并且两次调用 mark() 之间的时间为 10 秒。

A Timer calculates these above 4 metrics (considering each Timer.Context to be one event) and adds to them a number of additional metrics: Timer 计算上述 4 个指标(将每个 Timer.Context 视为一个事件),并向它们添加许多其他指标:

  • a count of the number of events事件数量的计数
  • min, mean and max durations seen since the start of Metrics自指标开始以来看到的最小、平均和最大持续时间
  • standard deviation标准差
  • a "histogram," recording the duration distributed at the 50th, 97th, 98th, 99th, and 99.95 percentiles “直方图”,记录分布在第 50、97、98、99 和 99.95 个百分位数的持续时间

There are something like 15 total metrics reported for each Timer.每个计时器报告了大约 15 个指标。

In short : Timers report a LOT of metrics, and they can be tricky to understand, but once you do they're a quite powerful way to spot spikey behavior.简而言之:计时器报告了很多指标,它们可能很难理解,但是一旦你这样做了,它们是一种非常有效的方法来发现尖峰行为。


Fact is, just collecting the time spent between two points isn't a terribly useful metric.事实上,仅仅收集两点之间花费的时间并不是一个非常有用的指标。 Consider: you have a block of code like this:考虑:你有一个这样的代码块:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //service time 10 ms
context.stop()

Let's assume that costlyOperation() has a constant cost, constant load, and operates on a single thread.让我们假设costlyOperation()具有恒定的成本、恒定的负载,并且在单个线程上运行。 Inside a 1 minute reporting period, we should expect to time this operation 6000 times.在 1 分钟的报告周期内,我们应该期望这个操作计时 6000 次。 Obviously, we will not be reporting the actual service time over the wire 6000x -- instead, we need some way to summarize all those operations to fit our desired reporting window.显然,我们不会通过 6000x 线路报告实际服务时间——相反,我们需要某种方式来总结所有这些操作以适合我们所需的报告窗口。 DW Metrics' Timer does this for us, automatically, once a minute (our reporting period). DW Metrics 的计时器自动为我们执行此操作,每分钟一次(我们的报告周期)。 After 5 minutes, our metrics registry would be reporting: 5 分钟后,我们的指标注册表将报告:

  • a rate of 100 (events per second)速率为 100(每秒事件数)
  • a 1-minute mean rate of 100 1 分钟的平均速率为 100
  • a 5-minute mean rate of 100 5 分钟平均速率为 100
  • a count of 30000 (total events seen)计数 30000(看到的事件总数)
  • a max of 10 (ms)最多 10 (ms)
  • a min of 10 10分钟
  • a mean of 10平均 10
  • a 50th percentile (p50) value of 10第 50 个百分位数 (p50) 值为 10
  • a 99.9th percentile (p999) value of 10第 99.9 个百分位数 (p999) 的值为 10

Now, let's consider we enter a period where occasionally our operation goes completely off the rails and blocks for an extended period:现在,让我们考虑进入一个时期,偶尔我们的操作会在很长一段时间内完全脱离轨道和阻塞:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //takes 10 ms usually, but once every 1000 times spikes to 1000 ms
context.stop()

Over a 1 minute collection period, we would now see fewer than 6000 executions, as every 1000th execution takes longer.在 1 分钟的收集期内,我们现在会看到不到 6000 次执行,因为每 1000 次执行需要更长的时间。 Works out to about 5505. After the first minute (6 minutes total system time) of this we would now see:大约为 5505。在第一分钟(系统总时间为 6 分钟)之后,我们现在将看到:

  • a mean rate of 98 (events per second)平均速率为 98(每秒事件数)
  • a 1-minute mean rate of 91.75 1 分钟平均汇率为 91.75
  • a 5-minute mean rate of 98.35 98.35 的 5 分钟平均速率
  • a count of 35505 (total events seen)计数 35505(看到的事件总数)
  • a max duration of 1000 (ms)最大持续时间为 1000 (ms)
  • a min duration of 10最短持续时间 10
  • a mean duration of 10.13平均持续时间 10.13
  • a 50th percentile (p50) value of 10第 50 个百分位数 (p50) 值为 10
  • a 99.9th percentile (p999) value of 1000 1000 的第 99.9 个百分位数 (p999) 值

If you graph this, you'd see that most requests (the p50, p75, p99, etc) were completing in 10 ms, but one request out of 1000 (p99) was completed in 1s.如果您绘制此图,您会看到大多数请求(p50、p75、p99 等)在 10 毫秒内完成,但 1000 个(p99)中的一个请求在 1 秒内完成。 This would also be seen as a slight reduction in the average rate (about 2%) and a sizable reduction in the 1-minute mean (nearly 9%).这也将被视为平均比率略有下降(约 2%)和 1 分钟平均值(近 9%)的大幅下降。

If you only look at over the time mean values (either rate or duration), you'll never spot these spikes -- they get dragged into the background noise when averaged with a lot of successful operations.如果您只查看时间平均值(速率或持续时间),您将永远不会发现这些尖峰——当对许多成功操作进行平均时,它们会被拖入背景噪音中。 Similarly, just knowing the max isn't helpful, because it doesn't tell you how frequently the max occurs.同样,仅仅知道最大值也无济于事,因为它不会告诉您最大值出现的频率。 This is why histograms are a powerful tool for tracking performance, and why DW Metrics' Timer publishes both a rate AND a histogram.这就是直方图是跟踪性能的强大工具的原因,也是 DW Metrics 的计时器发布速率和直方图的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM