arrayfun可能比matlab中的显式循环慢得多。为什么？

Question

Consider the following simple speed test for arrayfun : 考虑以下对arrayfun简单速度测试：

T = 4000;
N = 500;
x = randn(T, N);
Func1 = @(a) (3*a^2 + 2*a - 1);

tic
Soln1 = ones(T, N);
for t = 1:T
    for n = 1:N
        Soln1(t, n) = Func1(x(t, n));
    end
end
toc

tic
Soln2 = arrayfun(Func1, x);
toc

On my machine (Matlab 2011b on Linux Mint 12), the output of this test is: 在我的机器上（Linux Mint 12上的Matlab 2011b），该测试的输出是：

Elapsed time is 1.020689 seconds.
Elapsed time is 9.248388 seconds.

What the?!? 什么了？！？ arrayfun , while admittedly a cleaner looking solution, is an order of magnitude slower. arrayfun虽然看起来更清洁，但速度要慢一个数量级。 What is going on here? 这里发生了什么？

Further, I did a similar style of test for cellfun and found it to be about 3 times slower than an explicit loop. 此外，我对cellfun进行了类似的测试，发现它比显式循环慢约3倍。 Again, this result is the opposite of what I expected. 同样，这个结果与我的预期相反。

My question is: Why are arrayfun and cellfun so much slower? 我的问题是：为什么arrayfun和cellfun这么慢？ And given this, are there any good reasons to use them (other than to make the code look good)? 鉴于此，有没有充分的理由使用它们（除了使代码看起来很好）？

Note: I'm talking about the standard version of arrayfun here, NOT the GPU version from the parallel processing toolbox. 注意：我在这里谈的是arrayfun的标准版本，而不是并行处理工具箱中的GPU版本。

EDIT: Just to be clear, I'm aware that Func1 above can be vectorized as pointed out by Oli. 编辑：为了清楚，我知道上面的Func1可以像Oli指出的那样进行矢量化。 I only chose it because it yields a simple speed test for the purposes of the actual question. 我只选择了它，因为它为实际问题提供了简单的速度测试。

EDIT: Following the suggestion of grungetta, I re-did the test with feature accel off . 编辑：根据grungetta的建议，我重新进行了feature accel off测试。 The results are: 结果是：

Elapsed time is 28.183422 seconds.
Elapsed time is 23.525251 seconds.

In other words, it would appear that a big part of the difference is that the JIT accelerator does a much better job of speeding up the explicit for loop than it does arrayfun . 换句话说，差异的一大部分似乎是JIT加速器在加速显式for循环方面比在arrayfun 。 This seems odd to me, since arrayfun actually provides more information, ie, its use reveals that the order of the calls to Func1 do not matter. 这对我来说似乎很奇怪，因为arrayfun实际上提供了更多信息，即它的使用揭示了对Func1的调用顺序无关紧要。 Also, I noted that whether the JIT accelerator is switched on or off, my system only ever uses one CPU... 另外，我注意到JIT加速器是打开还是关闭，我的系统只使用一个CPU ......

Answer 1

You can get the idea by running other versions of your code. 您可以通过运行其他版本的代码来实现这个想法。 Consider explicitly writing out the computations, instead of using a function in your loop 考虑明确写出计算，而不是在循环中使用函数

tic
Soln3 = ones(T, N);
for t = 1:T
    for n = 1:N
        Soln3(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

Time to compute on my computer: 在我的电脑上计算的时间：

Soln1  1.158446 seconds.
Soln2  10.392475 seconds.
Soln3  0.239023 seconds.
Oli    0.010672 seconds.

Now, while the fully 'vectorized' solution is clearly the fastest, you can see that defining a function to be called for every x entry is a huge overhead. 现在，虽然完全“向量化”的解决方案显然是最快的，但您可以看到为每个x条目定义要调用的函数是一个巨大的开销。 Just explicitly writing out the computation got us factor 5 speedup. 只是明确地写出计算得到了因子5加速。 I guess this shows that MATLABs JIT compiler does not support inline functions . 我想这表明MATLABs JIT编译器不支持内联函数。 According to the answer by gnovice there, it is actually better to write a normal function rather than an anonymous one. 根据gnovice的回答，实际上写一个普通函数而不是一个匿名函数更好。 Try it. 试试吧。

Next step - remove (vectorize) the inner loop: 下一步 - 删除（向量化）内循环：

tic
Soln4 = ones(T, N);
for t = 1:T
    Soln4(t, :) = 3*x(t, :).^2 + 2*x(t, :) - 1;
end
toc

Soln4  0.053926 seconds.

Another factor 5 speedup: there is something in those statements saying you should avoid loops in MATLAB... Or is there really? 另一个因素是5加速：这些陈述中有些东西说你应该避免MATLAB中的循环...或者真的存在吗？ Have a look at this then 那么看看吧

tic
Soln5 = ones(T, N);
for n = 1:N
    Soln5(:, n) = 3*x(:, n).^2 + 2*x(:, n) - 1;
end
toc

Soln5   0.013875 seconds.

Much closer to the 'fully' vectorized version. 更接近'完全'矢量化版本。 Matlab stores matrices column-wise. Matlab按列存储矩阵。 You should always (when possible) structure your computations to be vectorized 'column-wise'. 您应始终（在可能的情况下）将计算结构化为“逐列”矢量化。

We can go back to Soln3 now. 我们现在可以回到Soln3了。 The loop order there is 'row-wise'. 循环顺序有“行方式”。 Lets change it 让我们改变它

tic
Soln6 = ones(T, N);
for n = 1:N
    for t = 1:T
        Soln6(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

Soln6  0.201661 seconds.

Better, but still very bad. 更好，但仍然非常糟糕。 Single loop - good. 单循环 - 很好。 Double loop - bad. 双循环 - 糟糕。 I guess MATLAB did some decent work on improving the performance of loops, but still the loop overhead is there. 我猜MATLAB在改进循环性能方面做了一些不错的工作，但仍然存在循环开销。 If you would have some heavier work inside, you would not notice. 如果你内心有一些较重的工作，你就不会注意到。 But since this computation is memory bandwidth bounded, you do see the loop overhead. 但是由于这个计算是有限的内存带宽，你确实看到了循环开销。 And you will even more clearly see the overhead of calling Func1 there. 你会更清楚地看到在那里调用Func1的开销。

So what's up with arrayfun? 那么arrayfun有什么用呢？ No function inlinig there either, so a lot of overhead. 在那里也没有任何功能，所以很多开销。 But why so much worse than a double nested loop? 但为什么比双嵌套循环更糟糕呢？ Actually, the topic of using cellfun/arrayfun has been extensively discussed many times (eg here , here , here and here ). 实际上，使用cellfun / arrayfun的主题已经被多次广泛讨论过（例如，这里，这里，这里和这里）。 These functions are simply slow, you can not use them for such fine-grain computations. 这些函数速度很慢，你不能将它们用于这种细粒度的计算。 You can use them for code brevity and fancy conversions between cells and arrays. 您可以使用它们来实现代码简洁以及单元格和数组之间的精细转换。 But the function needs to be heavier than what you wrote: 但功能需要比你写的更重：

tic
Soln7 = arrayfun(@(a)(3*x(:,a).^2 + 2*x(:,a) - 1), 1:N, 'UniformOutput', false);
toc

Soln7  0.016786 seconds.

Note that Soln7 is a cell now.. sometimes that is useful. 请注意，Soln7现在是一个单元格..有时这很有用。 Code performance is quite good now, and if you need cell as output, you do not need to convert your matrix after you have used the fully vectorized solution. 代码性能现在非常好，如果您需要单元格作为输出，则在使用完全矢量化解决方案后无需转换矩阵。

So why is arrayfun slower than a simple loop structure? 那么为什么arrayfun比简单的循环结构慢呢？ Unfortunately, it is impossible for us to say for sure, since there is no source code available. 不幸的是，我们不可能肯定地说，因为没有可用的源代码。 You can only guess that since arrayfun is a general purpose function, which handles all kinds of different data structures and arguments, it is not necessarily very fast in simple cases, which you can directly express as loop nests. 你只能猜测，因为arrayfun是一个通用函数，它处理各种不同的数据结构和参数，在简单的情况下它不一定非常快，你可以直接表示为循环嵌套。 Where does the overhead come from we can not know. 我们无法知道的开销来自哪里。 Could the overhead be avoided by a better implementation? 更好的实施可以避免开销吗？ Maybe not. 也许不吧。 But unfortunately the only thing we can do is study the performance to identify the cases, in which it works well, and those, where it doesn't. 但遗憾的是，我们唯一能做的就是研究性能，以确定适用的情况，以及不适用的情况。

Update Since the execution time of this test is short, to get reliable results I added now a loop around the tests: 更新由于此测试的执行时间很短，为了获得可靠的结果，我现在添加了一个围绕测试的循环：

for i=1:1000
   % compute
end

Some times given below: 有时候给出如下：

Soln5   8.192912 seconds.
Soln7  13.419675 seconds.
Oli     8.089113 seconds.

You see that the arrayfun is still bad, but at least not three orders of magnitude worse than the vectorized solution. 你看到arrayfun仍然很糟糕，但至少比矢量化解决方案差三个数量级。 On the other hand, a single loop with column-wise computations is as fast as the fully vectorized version... That was all done on a single CPU. 另一方面，具有逐列计算的单个循环与完全矢量化版本一样快......这都是在单个CPU上完成的。 Results for Soln5 and Soln7 do not change if I switch to 2 cores - In Soln5 I would have to use a parfor to get it parallelized. 如果切换到2个核心，Soln5和Soln7的结果不会改变 - 在Soln5中，我必须使用parfor来使其并行化。 Forget about speedup... Soln7 does not run in parallel because arrayfun does not run in parallel. 忘掉加速... Soln7并不是并行运行的，因为arrayfun并不是并行运行的。 Olis vectorized version on the other hand: 另一方面，Olis矢量化版本：

Oli  5.508085 seconds.

Answer 2

That because!!!! 那是因为!!!!

x = randn(T, N);

is not gpuarray type; 不是gpuarray类型;

All you need to do is 你需要做的就是

x = randn(T, N,'gpuArray');

arrayfun可能比matlab中的显式循环慢得多。为什么？

问题描述

2 个解决方案

解决方案1
101 已采纳 2012-09-21 08:33:59

解决方案2
-7 2014-08-12 10:39:41

arrayfun可能比matlab中的显式循环慢得多。为什么？

问题描述

2 个解决方案

解决方案1 101 已采纳 2012-09-21 08:33:59

解决方案2 -7 2014-08-12 10:39:41

解决方案1
101 已采纳 2012-09-21 08:33:59

解决方案2
-7 2014-08-12 10:39:41