[英]Why is numpy slower than for loop
I have a function using some for loops and I wanted to improve the speed using numpy. 我有一个使用一些for循环的函数,我想使用numpy来提高速度。 But this seems not to do the trick as the bumpy version appears to be 2 times slower.
但这似乎不可行,因为颠簸版本的速度似乎慢了2倍。 Here is the code:
这是代码:
import numpy as np
import itertools
import timeit
def func():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
for i in range(n_sample):
prod = 1
for k in range(dim):
sub = np.abs(sample[i, k] - 0.5)
prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 += prod
for i, j in itertools.product(range(n_sample), range(n_sample)):
prod = 1
for k in range(dim):
a = 0.5 * np.abs(sample[i, k] - 0.5)
b = 0.5 * np.abs(sample[j, k] - 0.5)
c = 0.5 * np.abs(sample[i, k] - sample[j, k])
prod *= 1 + a + b - c
disc2 += prod
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
def func_numpy():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np.abs(sample - 0.5) ** 2, axis=1))
for i, j in itertools.product(range(n_sample), range(n_sample)):
disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
print('Normal function time: ' , timeit.repeat('func()', number=20, repeat=5, setup="from __main__ import func"))
print('numpy function time: ', timeit.repeat('func_numpy()', number=20, repeat=5, setup="from __main__ import func_numpy"))
The timing output is: 定时输出为:
Normal function time: [2.831496894999873, 2.832342429959681, 2.8009242500411347, 2.8075121529982425, 2.824807019031141]
numpy function time: [5.154757721000351, 5.2011515340418555, 5.148996959964279, 5.095560318033677, 5.125199959962629]
What am I missing here? 我在这里想念什么? I know that the bottleneck is the itertools part because I have a 100x100x2 loop instead of a 100x2 loop before.
我知道瓶颈是itertools部分,因为我之前有一个100x100x2循环,而不是100x2循环。 Do you see another way to do that?
您是否看到另一种方法?
With NumPy, one must look to vectorize things and we could certainly do so here. 使用NumPy,人们必须着眼于向量化,我们当然可以在这里进行。
Taking a closer look at the loop portion, we are iterating along the first axis of the input data samples
twice with that loop startup : 仔细研究循环部分,我们在循环启动时沿着输入数据
samples
的第一个轴进行了两次迭代:
for i, j in itertools.product(range(n_sample), range(n_sample)):
We could convert these iterations into vectorized operations, once we let broadcasting
handle those. 一旦让
broadcasting
处理这些迭代,就可以将这些迭代转换为矢量化操作。
Now, to have a fully vectorized solution, we would need a lot more memory space, specifically (N,N,M)
, where (N,M)
is the shape of the input data. 现在,要有一个完全矢量化的解决方案,我们将需要更多的存储空间,特别是
(N,N,M)
,其中(N,M)
是输入数据的形状。
Another noticeable aspect here is that at each iteration, we aren't doing a lot of work, as we are performing operation on each row and each row contains only 2
elements for the given sample. 另一个值得注意的方面是,在每次迭代中,我们都没有做很多工作,因为我们在每一行上执行操作,并且每行仅包含
2
给定样本的元素。 So, the idea that comes out is that we could run a loop along M
, such that at each iteration, we would compute the prod
and accumulate. 因此,得出的想法是,我们可以沿着
M
运行循环,以便在每次迭代时,我们都可以计算prod
并进行累加。 Thus, for the given sample, its just two loop iterations. 因此,对于给定的样本,它只有两个循环迭代。
Getting out of the loop, we would have the accumulated prod
, that just needs summation for disc2
as the final output. 走出循环,我们将拥有累加的
prod
,只需disc2
总和作为最终输出即可。
Here's an implementation to fulfil the above mentioned ideas - 这是实现上述想法的一种实现-
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
Runtime test 运行时测试
The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below : 下面列出了原始方法的循环部分的简化版本和方法的修改版本:
def org_app(sample):
disc2 = 0
n_sample = len(sample)
for i, j in itertools.product(range(n_sample), range(n_sample)):
disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
return disc2
def mod_app(sample):
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
return disc2
Timings and verification - 时间和验证-
In [10]: sample = np.random.random_sample((100, 2))
In [11]: org_app(sample)
Out[11]: 11934.878683659041
In [12]: mod_app(sample)
Out[12]: 11934.878683659068
In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop
In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop
About 900x
speedup! 加速约
900x
! Well this should be motivating enough, hopefully to look to vectorize things whenever possible. 好吧,这应该有足够的动力,希望尽可能地对向量进行矢量化处理。
As I mentioned in the comments your solutions are not really optimal and it doesn't really make sense to compare not-ideal approaches. 正如我在评论中提到的那样,您的解决方案并不是真正的最佳选择,比较不理想的方法也没有任何意义。
For one thing iterating or indexing single elements of a NumPy array is really slow. 一方面,迭代或索引NumPy数组的单个元素确实很慢。 I recently answered a question including a lot of details (if you're interested you might have a look at it: "convert np array to a set takes too long" ).
我最近回答了一个包含很多细节的问题(如果您有兴趣,可以看一下: “将np数组转换为集合需要太长时间” )。 So the Python approach could be faster simply by converting the
array
to a list
: 因此,通过将
array
转换为list
,Python方法可能会更快:
def func():
sample = np.random.random_sample((100, 2))
disc1 = 0
n_sample = len(sample)
dim = sample.shape[1]
sample = sample.tolist() # converted to list
for i in range(n_sample):
prod = 1
for item in sample[i]:
sub = abs(item - 0.5)
prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 += prod
disc2 = 0
for i, j in itertools.product(range(n_sample), range(n_sample)):
prod = 1
for k in range(dim):
a = 0.5 * abs(sample[i][k] - 0.5)
b = 0.5 * abs(sample[j][k] - 0.5)
c = 0.5 * abs(sample[i][k] - sample[j][k])
prod *= 1 + a + b - c
disc2 += prod
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
I also replaced the np.abs
calls with normal abs
. 我还用普通
abs
替换了np.abs
调用。 The normal abs
has a lower overhead! 普通
abs
的开销较低! And also changed some other parts. 并且还更改了其他一些部分。 In the end this is more than 10-20 times faster than your original "normal" approach.
最后,这比原始的“常规”方法快10到20倍。
I didn't have time to check the NumPy approach yet and @Divarkar already included a really good and optimized approach. 我还没有时间检查NumPy的方法,@ Divarkar已经包含了一个非常好的和优化的方法。 Comparing the two approaches:
比较两种方法:
def func_numpy():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
disc1 = np.sum(np.prod(1 +
0.5 * np.abs(sample - 0.5) -
0.5 * np.abs(sample - 0.5) ** 2,
axis=1))
prod_arr = 1
for i in range(sample.shape[1]):
s0 = sample[:,i]
prod_arr *= (1 +
0.5 * np.abs(s0[:,None] - 0.5) +
0.5 * np.abs(s0 - 0.5) -
0.5 * np.abs(s0[:,None] - s0))
disc2 = prod_arr.sum()
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
print('Normal function time: ' ,
timeit.repeat('func()', number=20, repeat=3, setup="from __main__ import func"))
# Normal function time: [1.4846746248249474, 1.5018398493266432, 1.5476674017127152]
print('numpy function time: ',
timeit.repeat('func_numpy()', number=20, repeat=3, setup="from __main__ import func_numpy"))
# numpy function time: [0.020140038561976326, 0.016502230831292763, 0.016452520269695015]
So an optimized NumPy approach can definetly beat an "optimized" Python approach. 因此,优化的NumPy方法可以明确击败“优化”的Python方法。 It's almost 100 times faster.
快了将近100倍。 In case you want it even faster you could use numba on a slightly modified version of the pure python code:
如果您想更快地使用它,可以在经过稍微修改的纯python代码版本上使用numba :
import numba as nb
@nb.njit
def func_numba():
sample = np.random.random_sample((100, 2))
disc1 = 0
n_sample = len(sample)
dim = sample.shape[1]
for i in range(n_sample):
prod = 1
for item in sample[i]:
sub = abs(item - 0.5)
prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 += prod
disc2 = 0
for i in range(n_sample):
for j in range(n_sample):
prod = 1
for k in range(dim):
a = 0.5 * abs(sample[i,k] - 0.5)
b = 0.5 * abs(sample[j,k] - 0.5)
c = 0.5 * abs(sample[i,k] - sample[j,k])
prod *= 1 + a + b - c
disc2 += prod
return (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
func_numba()
print('numba function time: ' ,
timeit.repeat('func_numba()', number=20, repeat=3, setup="from __main__ import func_numba"))
# numba function time: [0.003022848984983284, 0.0030429566279508435, 0.004060626777572907]
That's almost a factor 8-10 faster than the NumPy approach. 这几乎比NumPy方法快8-10倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.