简体繁体 English

如何可靠地检测异常资源消耗？

[英]How to detect anomalous resource consumption reliably?

原文 2008-12-23 13:35:09 9 1 algorithm/ sysadmin/ false-positive/ resource-monitor

This question is about a whole class of similar problems, but I'll ask it as a concrete example. 这个问题是关于一类类似问题的，但我将作为一个具体示例进行询问。

I have a server with a file system whose contents fluctuate. 我有一台服务器，其文件系统的内容会波动。 I need to monitor the available space on this file system to ensure that it doesn't fill up. 我需要监视此文件系统上的可用空间，以确保它不会填满。 For the sake of argument, let's suppose that if it fills up, the server goes down. 为了争辩，我们假设如果服务器已满，则服务器将关闭。

It doesn't really matter what it is -- it might, for example, be a queue of "work". 它到底是什么并不重要-例如，可能是“工作”队列。

During "normal" operation, the available space varies within "normal" limits, but there may be pathologies: 在“正常”操作期间，可用空间在“正常”范围内变化，但是可能会出现以下情况：

Some other (possibly external) component that adds work may run out of control 其他一些（可能是外部）添加工作的组件可能会失去控制
Some component that removes work seizes up, but remains undetected 除去工作的某些组件被占用，但仍未被发现

The statistical characteristics of the process are basically unknown. 该过程的统计特征基本上是未知的。

What I'm looking for is an algorithm that takes, as input, timed periodic measurements of the available space (alternative suggestions for input are welcome), and produces as output, an alarm when things are "abnormal" and the file system is "likely to fill up". 我正在寻找一种算法，该算法将可用空间的定时定期测量作为输入（欢迎输入替代性建议），并在事物“异常”且文件系统为“可能会填满”。 It is obviously important to avoid false negatives, but almost as important to avoid false positives, to avoid numbing the brain of the sysadmin who gets the alarm. 避免误报显然很重要，但避免误报也几乎同等重要，以避免使收到警报的系统管理员的大脑麻木。

I appreciate that there are alternative solutions like throwing more storage space at the underlying problem, but I have actually experienced instances where 1000 times wasn't enough. 我很欣赏还有其他解决方案，例如在根本问题上投入更多的存储空间，但是我实际上遇到了1000次还不够的实例。

Algorithms which consider stored historical measurements are fine, although on-the-fly algorithms which minimise the amount of historic data are preferred. 考虑存储的历史测量值的算法很好，尽管首选的是将历史数据量最小化的实时算法。

I have accepted Frank's answer, and am now going back to the drawing-board to study his references in depth. 我已经接受了弗兰克的回答，现在回到制图板上，深入研究他的参考文献。

There are three cases, I think, of interest, not in order: 我认为有三种情况值得关注，而不是按顺序排列：

The "Harrods' Sale has just started" scenario: a peak of activity that at one-second resolution is "off the dial", but doesn't represent a real danger of resource depletion; “哈罗德百货公司的销售刚刚开始”的场景：在1秒钟的分辨率下，“高峰”的活动达到顶峰，但这并不表示资源枯竭的真正危险。
The "Global Warming" scenario: needing to plan for (relatively) stable growth; “全球变暖”场景：需要计划（相对）稳定的增长； and 和
The "Google is sending me an unsolicited copy of The Index" scenario: this will deplete all my resources in relatively short order unless I do something to stop it. “ Google向我发送了未经请求的The Index副本”场景：这将以相对较短的时间耗尽我的所有资源，除非我采取措施阻止它。

It's the last one that's (I think) most interesting, and challenging, from a sysadmin's point of view.. 从系统管理员的角度来看，这是（我认为）最有趣和最具挑战性的最后一个。

1 个解决方案

If it is actually related to a queue of work, then queueing theory may be the best route to an answer. 如果实际上与工作队列有关，那么排队理论可能是答案的最佳途径。

For the general case you could perhaps attempt a (multiple?) linear regression on the historical data, to detect if there is a statistically significant rising trend in the resource usage that is likely to lead to problems if it continues (you may also be able to predict how long it must continue to lead to problems with this technique - just set a threshold for 'problem' and use the slope of the trend to determine how long it will take). 对于一般情况，您可以尝试对历史数据进行（多个？）线性回归，以检测资源使用情况是否存在统计上显着的上升趋势，如果持续下去可能会导致问题（您也可以预测必须持续多长时间才能导致此技术出现问题-只需设置“问题”的阈值，然后使用趋势的斜率来确定需要多长时间即可。 You would have to play around with this and with the variables you collect though, to see if there is any statistically significant relationship that you can discover in the first place. 您将不得不处理此问题以及您收集的变量，以首先查看是否有任何统计学上有意义的关系。

Although it covers a completely different topic (global warming), I've found tamino's blog (tamino.wordpress.com) to be a very good resource on statistical analysis of data that is full of knowns and unknowns. 尽管它涵盖了完全不同的主题（全球变暖），但我发现tamino的博客（tamino.wordpress.com）是非常有用的数据统计分析资源，其中包含已知和未知数据。 For example, see this post. 例如，请参阅这篇文章。

edit: as per my comment I think the problem is somewhat analogous to the GW problem. 编辑：根据我的评论，我认为这个问题有点类似于GW问题。 You have short term bursts of activity which average out to zero, and long term trends superimposed that you are interested in. Also there is probably more than one long term trend, and it changes from time to time. 您的短期活动平均为零，而长期趋势叠加了您感兴趣的趋势。此外，长期趋势可能不止一个，并且会不时变化。 Tamino describes a technique which may be suitable for this, but unfortunately I cannot find the post I'm thinking of. 塔米诺（Tamino）描述了一种可能适用于此的技术，但不幸的是，我找不到我想的帖子。 It involves sliding regressions along the data (imagine multiple lines fitted to noisy data), and letting the data pick the inflection points. 它涉及沿数据滑动回归（想象多条线拟合到嘈杂的数据），然后让数据选择拐点。 If you could do this then you could perhaps identify a significant change in the trend. 如果可以这样做，那么您也许可以确定趋势中的重大变化。 Unfortunately it may only be identifiable after the fact, as you may need to accumulate a lot of data to get significance. 不幸的是，它可能仅在事实之后才可识别，因为您可能需要积累大量数据才能变得有意义。 But it might still be in time to head off resource depletion. 但是可能仍然应该及时阻止资源枯竭。 At least it may give you a robust way to determine what kind of safety margin and resources in reserve you need in future. 至少它可以为您提供一种可靠的方法来确定您将来需要哪种安全裕度和储备资源。