为什么numpy比for循环慢

Question

I have a function using some for loops and I wanted to improve the speed using numpy. 我有一个使用一些for循环的函数，我想使用numpy来提高速度。 But this seems not to do the trick as the bumpy version appears to be 2 times slower. 但这似乎不可行，因为颠簸版本的速度似乎慢了2倍。 Here is the code: 这是代码：

import numpy as np
import itertools
import timeit

def func():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    for i in range(n_sample):
        prod = 1
        for k in range(dim):
            sub = np.abs(sample[i, k] - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2

        disc1 += prod

    for i, j in itertools.product(range(n_sample), range(n_sample)):
        prod = 1
        for k in range(dim):
            a = 0.5 * np.abs(sample[i, k] - 0.5)
            b = 0.5 * np.abs(sample[j, k] - 0.5)
            c = 0.5 * np.abs(sample[i, k] - sample[j, k])
            prod *= 1 + a + b - c
        disc2 += prod

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


def func_numpy():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np.abs(sample - 0.5) ** 2, axis=1))

    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


print('Normal function time: ' , timeit.repeat('func()', number=20, repeat=5, setup="from __main__ import func"))
print('numpy function time: ', timeit.repeat('func_numpy()', number=20, repeat=5, setup="from __main__ import func_numpy"))

The timing output is: 定时输出为：

Normal function time:  [2.831496894999873, 2.832342429959681, 2.8009242500411347, 2.8075121529982425, 2.824807019031141]
numpy function time:  [5.154757721000351, 5.2011515340418555, 5.148996959964279, 5.095560318033677, 5.125199959962629]

What am I missing here? 我在这里想念什么？ I know that the bottleneck is the itertools part because I have a 100x100x2 loop instead of a 100x2 loop before. 我知道瓶颈是itertools部分，因为我之前有一个100x100x2循环，而不是100x2循环。 Do you see another way to do that? 您是否看到另一种方法？

Answer 1

With NumPy, one must look to vectorize things and we could certainly do so here. 使用NumPy，人们必须着眼于向量化，我们当然可以在这里进行。

Taking a closer look at the loop portion, we are iterating along the first axis of the input data samples twice with that loop startup : 仔细研究循环部分，我们在循环启动时沿着输入数据samples的第一个轴进行了两次迭代：

for i, j in itertools.product(range(n_sample), range(n_sample)):

We could convert these iterations into vectorized operations, once we let broadcasting handle those. 一旦让broadcasting处理这些迭代，就可以将这些迭代转换为矢量化操作。

Now, to have a fully vectorized solution, we would need a lot more memory space, specifically (N,N,M) , where (N,M) is the shape of the input data. 现在，要有一个完全矢量化的解决方案，我们将需要更多的存储空间，特别是(N,N,M) ，其中(N,M)是输入数据的形状。

Another noticeable aspect here is that at each iteration, we aren't doing a lot of work, as we are performing operation on each row and each row contains only 2 elements for the given sample. 另一个值得注意的方面是，在每次迭代中，我们都没有做很多工作，因为我们在每一行上执行操作，并且每行仅包含2给定样本的元素。 So, the idea that comes out is that we could run a loop along M , such that at each iteration, we would compute the prod and accumulate. 因此，得出的想法是，我们可以沿着M运行循环，以便在每次迭代时，我们都可以计算prod并进行累加。 Thus, for the given sample, its just two loop iterations. 因此，对于给定的样本，它只有两个循环迭代。

Getting out of the loop, we would have the accumulated prod , that just needs summation for disc2 as the final output. 走出循环，我们将拥有累加的prod ，只需disc2总和作为最终输出即可。

Here's an implementation to fulfil the above mentioned ideas - 这是实现上述想法的一种实现-

prod_arr = 1
for i in range(sample.shape[1]):
    si = sample[:,i]
    prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                    0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()

Runtime test 运行时测试

The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below : 下面列出了原始方法的循环部分的简化版本和方法的修改版本：

def org_app(sample):
    disc2 = 0
    n_sample = len(sample)
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
            np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
    return disc2


def mod_app(sample):
    prod_arr = 1
    for i in range(sample.shape[1]):
        si = sample[:,i]
        prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                        0.5 * np.abs(si[:,None] - si)
    disc2 = prod_arr.sum()
    return disc2

Timings and verification - 时间和验证-

In [10]: sample = np.random.random_sample((100, 2))

In [11]: org_app(sample)
Out[11]: 11934.878683659041

In [12]: mod_app(sample)
Out[12]: 11934.878683659068

In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop

In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop

About 900x speedup! 加速约900x ！ Well this should be motivating enough, hopefully to look to vectorize things whenever possible. 好吧，这应该有足够的动力，希望尽可能地对向量进行矢量化处理。

Answer 2

As I mentioned in the comments your solutions are not really optimal and it doesn't really make sense to compare not-ideal approaches. 正如我在评论中提到的那样，您的解决方案并不是真正的最佳选择，比较不理想的方法也没有任何意义。

For one thing iterating or indexing single elements of a NumPy array is really slow. 一方面，迭代或索引NumPy数组的单个元素确实很慢。 I recently answered a question including a lot of details (if you're interested you might have a look at it: "convert np array to a set takes too long" ). 我最近回答了一个包含很多细节的问题（如果您有兴趣，可以看一下： “将np数组转换为集合需要太长时间” ）。 So the Python approach could be faster simply by converting the array to a list : 因此，通过将array转换为list ，Python方法可能会更快：

def func():
    sample = np.random.random_sample((100, 2))
    disc1 = 0
    n_sample = len(sample)
    dim = sample.shape[1]
    sample = sample.tolist()  # converted to list

    for i in range(n_sample):
        prod = 1
        for item in sample[i]:
            sub = abs(item - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
        disc1 += prod

    disc2 = 0
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        prod = 1
        for k in range(dim):
            a = 0.5 * abs(sample[i][k] - 0.5)
            b = 0.5 * abs(sample[j][k] - 0.5)
            c = 0.5 * abs(sample[i][k] - sample[j][k])
            prod *= 1 + a + b - c
        disc2 += prod

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2

I also replaced the np.abs calls with normal abs . 我还用普通abs替换了np.abs调用。 The normal abs has a lower overhead! 普通abs的开销较低！ And also changed some other parts. 并且还更改了其他一些部分。 In the end this is more than 10-20 times faster than your original "normal" approach. 最后，这比原始的“常规”方法快10到20倍。

I didn't have time to check the NumPy approach yet and @Divarkar already included a really good and optimized approach. 我还没有时间检查NumPy的方法，@ Divarkar已经包含了一个非常好的和优化的方法。 Comparing the two approaches: 比较两种方法：

def func_numpy():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    disc1 = np.sum(np.prod(1 + 
                           0.5 * np.abs(sample - 0.5) - 
                           0.5 * np.abs(sample - 0.5) ** 2, 
                           axis=1))

    prod_arr = 1
    for i in range(sample.shape[1]):
        s0 = sample[:,i]
        prod_arr *= (1 + 
                     0.5 * np.abs(s0[:,None] - 0.5) + 
                     0.5 * np.abs(s0 - 0.5) - 
                     0.5 * np.abs(s0[:,None] - s0))
    disc2 = prod_arr.sum()

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


print('Normal function time: ' , 
      timeit.repeat('func()', number=20, repeat=3, setup="from __main__ import func"))
# Normal function time:  [1.4846746248249474, 1.5018398493266432, 1.5476674017127152]
print('numpy function time: ', 
      timeit.repeat('func_numpy()', number=20, repeat=3, setup="from __main__ import func_numpy"))
# numpy function time:  [0.020140038561976326, 0.016502230831292763, 0.016452520269695015]

So an optimized NumPy approach can definetly beat an "optimized" Python approach. 因此，优化的NumPy方法可以明确击败“优化”的Python方法。 It's almost 100 times faster. 快了将近100倍。 In case you want it even faster you could use numba on a slightly modified version of the pure python code: 如果您想更快地使用它，可以在经过稍微修改的纯python代码版本上使用numba ：

import numba as nb

@nb.njit
def func_numba():
    sample = np.random.random_sample((100, 2))
    disc1 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    for i in range(n_sample):
        prod = 1
        for item in sample[i]:
            sub = abs(item - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
        disc1 += prod

    disc2 = 0
    for i in range(n_sample):
        for j in range(n_sample):
            prod = 1
            for k in range(dim):
                a = 0.5 * abs(sample[i,k] - 0.5)
                b = 0.5 * abs(sample[j,k] - 0.5)
                c = 0.5 * abs(sample[i,k] - sample[j,k])
                prod *= 1 + a + b - c
            disc2 += prod

    return (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2

func_numba()


print('numba function time: ' , 
      timeit.repeat('func_numba()', number=20, repeat=3, setup="from __main__ import func_numba"))
# numba function time:  [0.003022848984983284, 0.0030429566279508435, 0.004060626777572907]

That's almost a factor 8-10 faster than the NumPy approach. 这几乎比NumPy方法快8-10倍。

为什么numpy比for循环慢

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-06-07 21:05:56

解决方案2
2 2017-06-07 21:15:08

为什么numpy比for循环慢

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-06-07 21:05:56

解决方案2 2 2017-06-07 21:15:08

解决方案1
3 已采纳 2017-06-07 21:05:56

解决方案2
2 2017-06-07 21:15:08