简体   繁体   English

如何在PHP中优化指数移动平均算法?

[英]How to optimise a Exponential Moving Average algorithm in PHP?

I'm trying to retrieve the last EMA of a large dataset (15000+ values). 我正在尝试检索大型数据集的最后一个EMA(值超过15000)。 It is a very resource-hungry algorithm since each value depends on the previous one. 这是一种非常耗费资源的算法,因为每个值都取决于前一个值。 Here is my code : 这是我的代码:

$k = 2/($range+1);
for ($i; $i<$size_data; ++$i) {
    $lastEMA = $lastEMA + $k * ($data[$i]-$lastEMA);
}

What I already did: 我已经做了:

  1. Isolate $k so it is not computed 10000+ times 隔离$k这样就不会计算10000+次
  2. Keep only the latest computed EMA, and not keep all of them in an array 仅保留最新计算的EMA,而不是将所有的EMA保留在数组中
  3. use for() instead of foreach() 使用for()代替foreach()
  4. the $data[] array doesn't have keys; $ data []数组没有键; it's a basic array 这是一个基本的数组

This allowed me to reduced execution time from 2000ms to about 500ms for 15000 values! 这使我可以将15000个值的执行时间从2000ms减少到大约500ms!

What didn't work: 什么不起作用:

  1. Use SplFixedArray(), this shaved only ~10ms executing 1,000,000 values 使用SplFixedArray(),这将在执行1,000,000个值时仅减少约10毫秒的时间
  2. Use PHP_Trader extension , this returns an array containing all the EMAs instead of just the latest, and it's slower 使用PHP_Trader扩展名 ,这将返回一个包含所有EMA的数组,而不仅仅是最新的,而且速度较慢

Writing and running the same algorithm in C# and running it over 2,000,000 values takes only 13ms! 用C#编写和运行相同的算法,并在2,000,000个值上运行它只需13毫秒! So obviously, using a compiled, lower-level language seems to help ;P 所以很明显,使用编译后的较低级语言似乎有所帮助; P

Where should I go from here? 我应该从这里去哪里? The code will ultimately run on Ubuntu, so which language should I choose? 该代码最终将在Ubuntu上运行,那么我应该选择哪种语言? Will PHP be able to call and pass such a huge argument to the script? PHP能够调用如此巨大的参数并将其传递给脚本吗?

Clearly implementing with an extension gives you a significant boost. 显然,通过扩展实施可以极大地促进您的发展。 Additionally the calculus can be improved as itself and that gain you can add in whichever language you choose. 另外,演算本身可以进行改进,并且您可以选择任何一种语言来添加。

It is easy to see that lastEMA can be calculated as follows: 可以很容易地看到,lastEMA可以如下计算:

$lastEMA = 0;
$k = 2/($range+1);
for ($i; $i<$size_data; ++$i) {
    $lastEMA = (1-$k) * $lastEMA + $k * $data[$i];
}

This can be rewritten as follows in order to take out of the loop as most as possible: 为了尽可能多地退出循环,可以将其重写如下:

$lastEMA = 0;
$k = 2/($range+1);
$k1m = 1 - $k;
for ($i; $i<$size_data; ++$i) {
    $lastEMA = $k1m * $lastEMA + $data[$i];
}
$lastEMA = $lastEMA * $k;

To explain the extraction of the "$k" think that in the previous formulation is as if all the original raw data are multiplied by $k so practically you can instead multiply the end result. 为了解释“ $ k”的提取,我们认为在前面的公式中,好像所有原始原始数据都乘以$ k,因此实际上您可以将最终结果乘以。

Note that, rewritten in this way, you have 2 operations inside the loop instead of 3 (to be precise inside the loop there are also $i increment, $i comparison with $size_data and $lastEMA value assignation) so this way you can expect to achieve an additional speedup in the range between the 16% and 33%. 请注意,以这种方式进行重写,您在循环内有2个操作而不是3个操作(确切地说,在循环内还有$ i增量,$ i与$ size_data的比较以及$ lastEMA值分配),因此您可以期望这种方式以实现16%到33%之间的额外加速。

Further there are other improvements that can be considered at least in some circumstances: 此外,至少在某些情况下还可以考虑其他改进:

Consider only last values 仅考虑最后一个值

The first values are multiplied several times by $k1m = 1 - $k so their contribute may be little or even go under the floating point precision (or the acceptable error). 第一个值乘以$k1m = 1 - $k几倍,因此它们的贡献可能很小,甚至不超过浮点精度(或可接受的误差)。

This idea is particularly helpful if you can do the assumption that older data are of the same order of magnitude as the newer because if you consider only the last $n values the error that you make is 如果您可以假设较旧的数据与较新的数据具有相同的数量级,则此想法特别有用,因为如果仅考虑最后的$ n值,则您所犯的错误是

$err = $EMA_of_discarded_data * (1-$k) ^ $n . $err = $EMA_of_discarded_data * (1-$k) ^ $n

So if order of magnitude is broadly the same we can tell that the relative error done is 因此,如果数量级大致相同,我们可以说完成的相对误差为

$rel_err = $err / $lastEMA = $EMA_of_discarded_data * (1-$k) ^ $n / $lastEMA

that is almost equal to simply (1-$k) ^ $n . 那几乎等于(1-$k) ^ $n

Under the assumption that "$lastEMA almost equal to $EMA_of_discarded_data" : “ $ lastEMA几乎等于$ EMA_of_discarded_data”的假设下:

  • Let's say that you can accept a relative error $rel_err 假设您可以接受相对错误$ rel_err
    • you can safely consider only the last $n values where (1 - $k)^$n < $rel_err. 您可以放心地只考虑最后的$ n值,其中(1-$ k)^ $ n <$ rel_err。
    • Means that you can pre-calculate (before the loop) $n = log($rel_err) / log (1-$k) and compute all only considering the last $n values. 意味着您可以(在循环之前)预先计算$ n = log($ rel_err)/ log(1- $ k)并仅考虑最后的$ n值来计算所有值。
    • If the dataset is very big this can give a sensible speedup. 如果数据集非常大,则可以明显提高速度。
  • Consider that for 64 bit floating point numbers you have a relative precision (related to the mantissa) that is 2^-53 (about 1.1e-16 and only 2^-24 = 5.96e-8 for 32 bit floating point numbers) so you cannot obtain better than this relative error 考虑到对于64位浮点数,您具有2 ^ -53的相对精度(与尾数有关)(大约为1.1e-16,对于32位浮点数,只有2 ^ -24 = 5.96e-8),因此您无法获得比这个相对误差更好的结果
    • so basically you should never have an advantage in calculating more than $n = log(1.1e-16) / log(1-$k) values. 因此,从根本上讲,在计算$ n = log(1.1e-16)/ log(1- $ k)值时,永远不会有优势。
    • to give an example if $range = 2000 then $n = log(1.1e-16) / log(1-2/2001) = 36'746. 举个例子,如果$ range = 2000,则$ n = log(1.1e-16)/ log(1-2 / 2001)= 36'746。
      • I think that is interesting to know that extra calculations would go lost inside the roundings ==> it is useless ==> is better not to do. 我想知道有趣的是,舍入==>会丢失多余的计算,这是没有用的==>最好不要这样做。
  • now one example for the case where you can accept a relative error larger than floating point precision $rel_err = 1ppm = 1e-6 = 0.00001% = 6 significant decimal digits you have $n = log(1.1e-16) / log(1-2/2001) = 13'815 现在是一个示例,其中您可以接受大于浮点精度的相对误差$ rel_err = 1ppm = 1e-6 = 0.00001%= 6个有效十进制数字您有$ n = log(1.1e-16)/ log(1 -2/2001)= 13'815
    • I think is quite a little number compared to your last samples numbers so in that cases the speedup could be evident (I'm assuming that $range = 2000 is meaningful or high for your application but thi I cannot know). 我认为与您上一次采样的数字相比,这个数字要小得多,因此在这种情况下,提速是显而易见的(我假设$ range = 2000对您的应用有意义或很高,但我不知道)。
  • just other few numbers because I do not know what are your typical figures: 其他几个数字,因为我不知道您的典型数字是什么:
    • $rel_err = 1e-3; $ rel_err = 1e-3; $range = 2000 => $n = 6'907 $范围= 2000 => $ n = 6907
    • $rel_err = 1e-3; $ rel_err = 1e-3; $range = 200 => $n = 691 $范围= 200 => $ n = 691
    • $rel_err = 1e-3; $ rel_err = 1e-3; $range = 20 => $n = 69 $范围= 20 => $ n = 69
    • $rel_err = 1e-6; $ rel_err = 1e-6; $range = 2000 => $n = 13'815 $范围= 2000 => $ n = 13'815
    • $rel_err = 1e-6; $ rel_err = 1e-6; $range = 200 => $n = 1'381 $范围= 200 => $ n = 1'381
    • $rel_err = 1e-6; $ rel_err = 1e-6; $range = 20 => $n = 138 $范围= 20 => $ n = 138

If the assumption "$lastEMA almost equal to $EMA_of_discarded_data" cannot be taken things are less easy but since the advantage cam be significant it can be meaningful to go on: 如果无法假设“ $ lastEMA几乎等于$ EMA_of_discarded_data” ,那么事情就不那么容易了,但是由于优势很大,因此继续下去可能很有意义:

  • we need to re-consider the full formula: $rel_err = $EMA_of_discarded_data * (1-$k) ^ $n / $lastEMA 我们需要重新考虑完整的公式:$ rel_err = $ EMA_of_discarded_data *(1- $ k)^ $ n / $ lastEMA
  • so $n = log($rel_err * $lastEMA / $EMA_of_discarded_data) / log (1-$k) = (log($rel_err) + log($lastEMA / $EMA_of_discarded_data)) / log (1-$k) 所以$ n =日志($ rel_err * $ lastEMA / $ EMA_of_discarded_data)/日志(1- $ k)=(log($ rel_err)+日志($ lastEMA / $ EMA_of_discarded_data))/日志(1- $ k)
  • the central point is to calculate $lastEMA / $EMA_of_discarded_data (without actually calculating $lastEMA nor $EMA_of_discarded_data of course) 中心点是计算$ lastEMA / $ EMA_of_discarded_data(当然不实际计算$ lastEMA或$ EMA_of_discarded_data)
    • one case is when we know a-priori that for example $EMA_of_discarded_data / $lastEMA < M (for example M = 1000 or M = 1e6) 一种情况是我们知道先验的情况,例如$ EMA_of_discarded_data / $ lastEMA <M(例如M = 1000或M = 1e6)
      • in that case $n < (log($rel_err/M)) / log (1-$k) 在这种情况下$ n <(log($ rel_err / M))/ log(1- $ k)
    • if you cannot give any M number 如果您不能提供任何M号
      • you have to find a good idea to over-estimate $EMA_of_discarded_data / $lastEMA 您必须找到一个好主意,以高估$ EMA_of_discarded_data / $ lastEMA
      • one quick way could be to take M = max(data) / min(data) 一种快速的方法是取M = max(data)/ min(data)

Parallelization 并行

The calculation can be re-written in a form where it is a simple addition of independent terms: 可以以简单添加独立项的形式重写计算:

$lastEMA = 0;
$k = 2/($range+1);
$k1m = 1 - $k;
for ($i; $i<$size_data; ++$i) {
    $lastEMA += $k1m ^ ($size_data - 1 - $i) *  $data[$i];
}
$lastEMA = $lastEMA * $k;

So if the implementing language supports parallelization the dataset can be divided in 4 (or 8 or n ...basically the number of CPU cores available) chunks and it can be computed the sum of terms on each chunk in parallel summing up the individual results at the end. 因此,如果实现语言支持并行化,则可以将数据集划分为4个(或8个或n个……基本上是可用的CPU核心数)块,并且可以并行计算每个块上项的总和,以汇总各个结果在末尾。

I do not go in detail with this since this reply is already terribly long and I think the concept is already expressed. 我对此不作详细介绍,因为这个答复已经很长了,我认为这个概念已经表达出来了。

Building your own extension definitely improves performance. 构建自己的扩展程序绝对可以提高性能。 Here's a good tutorial from the Zend website. 这是Zend网站上的一个很好的教程

Some performance figures: Hardware: Ubuntu 14.04, PHP 5.5.9, 1-core Intel CPU@3.3Ghz, 128MB RAM (it's a VPS). 一些性能指标:硬件:Ubuntu 14.04,PHP 5.5.9、1核Intel CPU @ 3.3Ghz,128MB RAM(这是VPS)。

  • Before (PHP only, 16,000 values) : 500ms 之前(仅PHP,16,000个值):500ms
  • C Extension, 16,000 values : 0.3ms C扩展名,16,000个值:0.3ms
  • C Extension (100,000 values) : 3.7ms C扩展名(100,000个值):3.7ms
  • C Extension (500,000 values) : 28.0ms C扩展名(500,000个值):28.0ms

But I'm memory limited at this point, using 70MB. 但目前我的内存有限,只能使用70MB。 I will fix that and update the numbers accordingly. 我将修复该问题并相应地更新数字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM