[英]How to optimise a Exponential Moving Average algorithm in PHP?
I'm trying to retrieve the last EMA of a large dataset (15000+ values). 我正在尝试检索大型数据集的最后一个EMA(值超过15000)。 It is a very resource-hungry algorithm since each value depends on the previous one. 这是一种非常耗费资源的算法,因为每个值都取决于前一个值。 Here is my code : 这是我的代码:
$k = 2/($range+1);
for ($i; $i<$size_data; ++$i) {
$lastEMA = $lastEMA + $k * ($data[$i]-$lastEMA);
}
What I already did: 我已经做了:
$k
so it is not computed 10000+ times 隔离$k
这样就不会计算10000+次 for()
instead of foreach()
使用for()
代替foreach()
This allowed me to reduced execution time from 2000ms to about 500ms for 15000 values! 这使我可以将15000个值的执行时间从2000ms减少到大约500ms!
What didn't work: 什么不起作用:
Writing and running the same algorithm in C# and running it over 2,000,000 values takes only 13ms! 用C#编写和运行相同的算法,并在2,000,000个值上运行它只需13毫秒! So obviously, using a compiled, lower-level language seems to help ;P 所以很明显,使用编译后的较低级语言似乎有所帮助; P
Where should I go from here? 我应该从这里去哪里? The code will ultimately run on Ubuntu, so which language should I choose? 该代码最终将在Ubuntu上运行,那么我应该选择哪种语言? Will PHP be able to call and pass such a huge argument to the script? PHP能够调用如此巨大的参数并将其传递给脚本吗?
Clearly implementing with an extension gives you a significant boost. 显然,通过扩展实施可以极大地促进您的发展。 Additionally the calculus can be improved as itself and that gain you can add in whichever language you choose. 另外,演算本身可以进行改进,并且您可以选择任何一种语言来添加。
It is easy to see that lastEMA can be calculated as follows: 可以很容易地看到,lastEMA可以如下计算:
$lastEMA = 0;
$k = 2/($range+1);
for ($i; $i<$size_data; ++$i) {
$lastEMA = (1-$k) * $lastEMA + $k * $data[$i];
}
This can be rewritten as follows in order to take out of the loop as most as possible: 为了尽可能多地退出循环,可以将其重写如下:
$lastEMA = 0;
$k = 2/($range+1);
$k1m = 1 - $k;
for ($i; $i<$size_data; ++$i) {
$lastEMA = $k1m * $lastEMA + $data[$i];
}
$lastEMA = $lastEMA * $k;
To explain the extraction of the "$k" think that in the previous formulation is as if all the original raw data are multiplied by $k so practically you can instead multiply the end result. 为了解释“ $ k”的提取,我们认为在前面的公式中,好像所有原始原始数据都乘以$ k,因此实际上您可以将最终结果乘以。
Note that, rewritten in this way, you have 2 operations inside the loop instead of 3 (to be precise inside the loop there are also $i increment, $i comparison with $size_data and $lastEMA value assignation) so this way you can expect to achieve an additional speedup in the range between the 16% and 33%. 请注意,以这种方式进行重写,您在循环内有2个操作而不是3个操作(确切地说,在循环内还有$ i增量,$ i与$ size_data的比较以及$ lastEMA值分配),因此您可以期望这种方式以实现16%到33%之间的额外加速。
Further there are other improvements that can be considered at least in some circumstances: 此外,至少在某些情况下还可以考虑其他改进:
The first values are multiplied several times by $k1m = 1 - $k
so their contribute may be little or even go under the floating point precision (or the acceptable error). 第一个值乘以$k1m = 1 - $k
几倍,因此它们的贡献可能很小,甚至不超过浮点精度(或可接受的误差)。
This idea is particularly helpful if you can do the assumption that older data are of the same order of magnitude as the newer because if you consider only the last $n values the error that you make is 如果您可以假设较旧的数据与较新的数据具有相同的数量级,则此想法特别有用,因为如果仅考虑最后的$ n值,则您所犯的错误是
$err = $EMA_of_discarded_data * (1-$k) ^ $n
. $err = $EMA_of_discarded_data * (1-$k) ^ $n
。
So if order of magnitude is broadly the same we can tell that the relative error done is 因此,如果数量级大致相同,我们可以说完成的相对误差为
$rel_err = $err / $lastEMA = $EMA_of_discarded_data * (1-$k) ^ $n / $lastEMA
that is almost equal to simply (1-$k) ^ $n
. 那几乎等于(1-$k) ^ $n
。
Under the assumption that "$lastEMA almost equal to $EMA_of_discarded_data" : 在“ $ lastEMA几乎等于$ EMA_of_discarded_data”的假设下:
If the assumption "$lastEMA almost equal to $EMA_of_discarded_data" cannot be taken things are less easy but since the advantage cam be significant it can be meaningful to go on: 如果无法假设“ $ lastEMA几乎等于$ EMA_of_discarded_data” ,那么事情就不那么容易了,但是由于优势很大,因此继续下去可能很有意义:
The calculation can be re-written in a form where it is a simple addition of independent terms: 可以以简单添加独立项的形式重写计算:
$lastEMA = 0;
$k = 2/($range+1);
$k1m = 1 - $k;
for ($i; $i<$size_data; ++$i) {
$lastEMA += $k1m ^ ($size_data - 1 - $i) * $data[$i];
}
$lastEMA = $lastEMA * $k;
So if the implementing language supports parallelization the dataset can be divided in 4 (or 8 or n ...basically the number of CPU cores available) chunks and it can be computed the sum of terms on each chunk in parallel summing up the individual results at the end. 因此,如果实现语言支持并行化,则可以将数据集划分为4个(或8个或n个……基本上是可用的CPU核心数)块,并且可以并行计算每个块上项的总和,以汇总各个结果在末尾。
I do not go in detail with this since this reply is already terribly long and I think the concept is already expressed. 我对此不作详细介绍,因为这个答复已经很长了,我认为这个概念已经表达出来了。
Building your own extension definitely improves performance. 构建自己的扩展程序绝对可以提高性能。 Here's a good tutorial from the Zend website. 这是Zend网站上的一个很好的教程 。
Some performance figures: Hardware: Ubuntu 14.04, PHP 5.5.9, 1-core Intel CPU@3.3Ghz, 128MB RAM (it's a VPS). 一些性能指标:硬件:Ubuntu 14.04,PHP 5.5.9、1核Intel CPU @ 3.3Ghz,128MB RAM(这是VPS)。
But I'm memory limited at this point, using 70MB. 但目前我的内存有限,只能使用70MB。 I will fix that and update the numbers accordingly. 我将修复该问题并相应地更新数字。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.