简体繁体 English

如何计算Perl中一串数字的中值和标准差？

[英]How can I calculate the median and standard deviation of a bunch stream of numbers in Perl?

原文 2009-09-29 07:47:05 5 4 perl/ statistics/ logging/ median

In our logfiles we store response times for the requests. 在我们的日志文件中，我们存储请求的响应时间。 What's the most efficient way to calculate the median response time, the "75/90/95% of requests were served in less than N time" numbers etc? 计算中间响应时间的最有效方法是什么，“75/90/95％的请求是在少于N个时间内提供的”数字等？ (I guess a variation of my question is: What's the best way to calculate the median and standard deviation of a bunch stream of numbers). （我想我的问题的一个变体是：计算一串数字流的中位数和标准差的最佳方法是什么）。

The best I came up with was just reading all the numbers, ordering them and then picking out the numbers, but that seems really goofy. 我想出的最好的只是阅读所有数字，订购它们然后挑出数字，但这看起来真的很傻。 Isn't there a smarter way? 是不是有更聪明的方法？

We use Perl, but solutions for any language might be helpful. 我们使用Perl，但任何语言的解决方案都可能有所帮助。

4 个解决方案

See the article Calculating Percentiles in Memory-bound Applications . 请参阅文章计算内存绑定应用程序中的百分位数。 It explains how to calculate median and other percentiles efficiently. 它解释了如何有效地计算中位数和其他百分位数。

Also, here's an article on calculating standard deviation (variance) as you go: Accurately computing running variance . 另外，这里有一篇关于计算标准偏差（方差）的文章：准确计算运行方差。

you can have look at quick select: 你可以看看快速选择：

http://en.wikipedia.org/wiki/Selection_algorithm http://en.wikipedia.org/wiki/Selection_algorithm

Or at the Wirth algorithm: http://www.mail-archive.com/numpy-discussion@scipy.org/msg20059.html 或者在Wirth算法： http ： //www.mail-archive.com/numpy-discussion@scipy.org/msg20059.html

Benchmark for the median can be found here: http://ndevilla.free.fr/median/median/index.html 可以在此处找到中位数的基准： http ： //ndevilla.free.fr/median/median/index.html

Have a look at PDL ... the Perl Data Language. 看看PDL ...... Perl数据语言。

Also see these previous SO questions about mean/std dev: 另请参阅以前关于mean / std dev的SO问题：