[英]Performance in different vectorization method in numpy
I wanted to test the performance of vectorizing code in python: 我想在python中测试矢量化代码的性能:
import timeit
import numpy as np
def func1():
x = np.arange(1000)
sum = np.sum(x*2)
return sum
def func2():
sum = 0
for i in xrange(1000):
sum += i*2
return sum
def func3():
sum = 0
for i in xrange(0,1000,4):
x = np.arange(i,i+4,1)
sum += np.sum(x*2)
return sum
print timeit.timeit(func1, number = 1000)
print timeit.timeit(func2, number = 1000)
print timeit.timeit(func3, number = 1000)
The code gives the following output: 代码提供以下输出:
0.0105729103088
0.069864988327
0.983253955841
The performance difference in the first and second functions are not surprising. 第一和第二功能的性能差异并不令人惊讶。 But I was surprised that the 3rd function is significantly slower than the other functions.
但我很惊讶第3个功能明显慢于其他功能。
I am much more familiar in vectorising code in C than in Python and the 3rd function is more C-like - running a for loop and processing 4 numbers in one instruction in each loop. 我在C中的代码中比在Python中更熟悉,第三个函数更像C - 运行for循环并在每个循环中的一条指令中处理4个数字。 To my understanding numpy calls a C function and then vectorize the code in C. So if this is the case my code is also passing 4 numbers to numpy each at a time.
根据我的理解,numpy调用C函数,然后在C中对代码进行矢量化。因此,如果是这种情况,我的代码也会一次传递4个数字到numpy。 The code shouldn't perform better when I pass more numbers at once.
当我一次传递更多数字时,代码应该不会更好。 So why is it much more slower?
那为什么它要慢得多呢? Is it because of the overhead in calling a numpy function?
是因为调用numpy函数的开销?
Besides, the reason that I even came up with the 3rd function in the first place is because I'm worried about the performance of the large amount of memory allocation to x
in func1
. 此外,我甚至首先提出第三个功能的原因是因为我担心
func1
x
的大量内存分配的性能。
Is my worry valid? 我的担心有效吗? Why and how can I improve it or why not?
为什么以及如何改进它或为什么不改进?
Thanks in advance. 提前致谢。
Edit: 编辑:
For curiosity sake, although it defeats my original purpose for creating the 3rd version, I have looked into roganjosh's suggestion and tried the following edit. 为了好奇,尽管它打破了我创建第3版的最初目的,但我已经研究了roganjosh的建议,并尝试了以下编辑。
def func3():
sum = 0
x = np.arange(0,1000)
for i in xrange(0,1000,4):
sum += np.sum(x[i:i+4]*2)
return sum
The output: 输出:
0.0104308128357
0.0630609989166
0.748773813248
There is an improvement, but still a large gap compared with the other functions. 虽然有所改进,但与其他功能相比仍有很大差距。
Is it because x[i:i+4]
still creates a new array? 是因为
x[i:i+4]
仍然会创建一个新数组吗?
Edit 2: 编辑2:
I've modified the code again according to Daniel's suggestion. 我根据Daniel的建议再次修改了代码。
def func1():
x = np.arange(1000)
x *= 2
return x.sum()
def func3():
sum = 0
x = np.arange(0,1000)
for i in xrange(0,1000,4):
x[i:i+4] *= 2
sum += x[i:i+4].sum()
return sum
The output: 输出:
0.00824999809265
0.0660569667816
0.598328828812
There is another speedup. 还有另一种加速。 So the declaration of numpy arrays are definitely a problem.Now in func3 there should be one array declaration only, but yet the time is still way slower.
所以numpy数组的声明肯定是个问题。现在func3中应该只有一个数组声明,但是时间仍然慢一些。 Is it because of the overhead of calling numpy arrays?
是因为调用numpy数组的开销?
It seems you're mostly interested in the difference between your function 3 compared to the pure NumPy (function 1) and Python (function 2) approaches. 看起来你最感兴趣的是你的函数3与纯 NumPy(函数1)和Python(函数2)方法之间的区别。 The answer is quite simple (especially if you look at function 4):
答案很简单(特别是如果你看一下函数4):
You typically need several thousand elements to get in the regime where the runtime of np.sum
actually depends on the number of elements in the array. 您通常需要数千个元素才能进入
np.sum
运行时实际上取决于数组中元素数量的机制。 Using IPython and matplotlib (the plot is at the end of the answer) you can easily check the runtime dependency: 使用IPython和matplotlib(情节在答案的最后),您可以轻松检查运行时依赖性:
import numpy as np
n = []
timing_sum1 = []
timing_sum2 = []
for i in range(1, 25):
num = 2**i
arr = np.arange(num)
print(num)
time1 = %timeit -o arr.sum() # calling the method
time2 = %timeit -o np.sum(arr) # calling the function
n.append(num)
timing_sum1.append(time1)
timing_sum2.append(time2)
The results for np.sum
(shortened) are quite interesting: np.sum
(缩短)的结果非常有趣:
4
22.6 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16
25.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64
25.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256
24.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1024
24.6 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4096
27.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16384
40.6 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65536
91.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
262144
394 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1048576
1.24 ms ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4194304
4.71 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16777216
18.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It seems the constant factor is roughly 20µs
on my computer) and it takes an array with 16384 thousand elements to double that time. 似乎恒定因子在我的计算机上大约是
20µs
)并且它需要一个具有16384,000个元素的阵列来加倍。 So the timing for function 3 and 4 are mostly timing multiplicatives of the constant factor. 因此,函数3和4的时序主要是常数因子的时间乘法。
In function 3 you include the constant factor 2 times, once with np.sum
and once with np.arange
. 在函数3中,您包括常数因子2次,一次使用
np.sum
,一次使用np.arange
。 In this case arange
is quite cheap because each array is the same size, so NumPy & Python & your OS probably reuse the memory of the array of the last iteration. 在这种情况下,
arange
非常便宜,因为每个数组的大小都相同,因此NumPy和Python以及您的操作系统可能会重用上一次迭代的数组内存。 However even that takes time (roughly 2µs
for very small arrays on my computer). 然而,即使这需要时间(对于我的计算机上的非常小的阵列,大约需要
2µs
)。
More generally: To identify bottlenecks you should always profile the functions! 更一般地说:要识别瓶颈,您应该始终分析功能!
I show the results for the functions with line-profiler . 我用line-profiler显示函数的结果。 Therefore I altered the functions a bit so they only do one operation per line:
因此我稍微改变了函数,因此它们每行只执行一次操作:
import numpy as np
def func1():
x = np.arange(1000)
x = x*2
return np.sum(x)
def func2():
sum_ = 0
for i in range(1000):
tmp = i*2
sum_ += tmp
return sum_
def func3():
sum_ = 0
for i in range(0, 1000, 4): # I'm using python3, so "range" is like "xrange"!
x = np.arange(i, i + 4, 1)
x = x * 2
tmp = np.sum(x)
sum_ += tmp
return sum_
def func4():
sum_ = 0
x = np.arange(1000)
for i in range(0, 1000, 4):
y = x[i:i + 4]
y = y * 2
tmp = np.sum(y)
sum_ += tmp
return sum_
Results: 结果:
%load_ext line_profiler
%lprun -f func1 func1()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 def func1():
5 1 62 62.0 23.8 x = np.arange(1000)
6 1 65 65.0 24.9 x = x*2
7 1 134 134.0 51.3 return np.sum(x)
%lprun -f func2 func2()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
9 def func2():
10 1 7 7.0 0.1 sum_ = 0
11 1001 2523 2.5 30.9 for i in range(1000):
12 1000 2819 2.8 34.5 tmp = i*2
13 1000 2819 2.8 34.5 sum_ += tmp
14 1 3 3.0 0.0 return sum_
%lprun -f func3 func3()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 def func3():
17 1 7 7.0 0.0 sum_ = 0
18 251 909 3.6 2.9 for i in range(0, 1000, 4):
19 250 6527 26.1 21.2 x = np.arange(i, i + 4, 1)
20 250 5615 22.5 18.2 x = x * 2
21 250 16053 64.2 52.1 tmp = np.sum(x)
22 250 1720 6.9 5.6 sum_ += tmp
23 1 3 3.0 0.0 return sum_
%lprun -f func4 func4()
Line # Hits Time Per Hit % Time Line Contents
==============================================================
25 def func4():
26 1 7 7.0 0.0 sum_ = 0
27 1 49 49.0 0.2 x = np.arange(1000)
28 251 892 3.6 3.4 for i in range(0, 1000, 4):
29 250 2177 8.7 8.3 y = x[i:i + 4]
30 250 5431 21.7 20.7 y = y * 2
31 250 15990 64.0 60.9 tmp = np.sum(y)
32 250 1686 6.7 6.4 sum_ += tmp
33 1 3 3.0 0.0 return sum_
I won't go into the details of the results, but as you can see np.sum
is definetly the bottleneck in func3
and func4
. 我不会详细介绍结果,但正如你所看到的,
np.sum
是func3
和func4
的瓶颈。 I already guessed that np.sum
is the bottleneck before I wrote the answer but these line-profilings actually verify that it is the bottleneck. 我已经猜到
np.sum
是瓶颈之前我写的答案,但这些线路实际上成型部验证它是瓶颈 。
Which leads to a very important fact when using NumPy: 这在使用NumPy时会产生一个非常重要的事实:
If you really believe some part is too slow then you can use: 如果你真的相信某些部分太慢,那么你可以使用:
But generally you probably can't beat NumPy for moderatly sized (several thousand entries and more) arrays. 但一般来说,你可能无法击败NumPy的数量级(数千个条目和更多)阵列。
%matplotlib notebook
import matplotlib.pyplot as plt
# Average time per sum-call
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average for time in timing_sum1], label='arr.sum()', c='red')
ax.plot(n, [time.average for time in timing_sum2], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time it takes to sum them [seconds]')
ax.grid(which='both')
ax.legend()
# Average time per element
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum1)], label='arr.sum()', c='red')
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum2)], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time per element [seconds / element]')
ax.grid(which='both')
ax.legend()
The plots are log-log, I think it was the best way to visualize the data given that it extends several orders of magnitude (I just hope it's still understandable). 这些图是log-log,我认为这是可视化数据的最佳方式,因为它扩展了几个数量级(我只希望它仍然可以理解)。
The first plot shows how much time it takes to do the sum
: 第一个图表显示了
sum
所需的时间:
The second plot shows the average time it takes to do the sum
divided by the number of elements in the array. 第二个图显示了
sum
除以数组中元素数所需的平均时间。 This is just another way to interpret the data: 这只是解释数据的另一种方式:
Based on the tests (shown next) it seems, you are beaten by the functional call overhead. 根据测试(如下所示),您似乎被功能调用开销所打败。 Alongwith the vectorized capability of NumPy functions/tools, we need to give it enough data for crunching.
除了NumPy函数/工具的矢量化功能外,我们还需要为其提供足够的数据来进行运算。 With
func3
, we are giving it just 4
elements per call to np.sum
. 使用
func3
,我们每次调用np.sum
只给它4
元素。
Let's investigate the per call overhead for np.sum
. 让我们研究一下
np.sum
的每次调用开销。 Here's np.sum
starting with summing of none, one element and onwards - 这里是
np.sum
,一个元素和向前的总和开始 -
In [90]: a = np.array([])
In [91]: %timeit np.sum(a)
1000000 loops, best of 3: 1.6 µs per loop
In [61]: a = np.array([0])
In [62]: %timeit np.sum(a)
1000000 loops, best of 3: 1.66 µs per loop
In [63]: a = np.random.randint(0,9,(100))
In [64]: %timeit np.sum(a)
100000 loops, best of 3: 1.79 µs per loop
In [65]: a = np.random.randint(0,9,(1000))
In [66]: %timeit np.sum(a)
100000 loops, best of 3: 2.25 µs per loop
In [67]: a = np.random.randint(0,9,(10000))
In [68]: %timeit np.sum(a)
100000 loops, best of 3: 7.27 µs per loop
and so on. 等等。
Thus, we would incur a minimum of around 1.6 u-sec
per call to np.sum
on the system setup for these tests. 因此,对于这些测试的系统设置,我们将在每次调用
np.sum
时产生至少约1.6 u-sec
。
Let's see how the scalar addition with the addition operator performs - 让我们看一下加法运算符的标量加法如何执行 -
In [98]: def add_nums(a,b):
...: return a+b
...:
In [99]: %timeit add_nums(2,3)
10000000 loops, best of 3: 71.5 ns per loop
This is about 25x
faster than the per call overhead for np.sum
. 这比
np.sum
的每次调用开销快约25x
。
The obvious idea next up is to test out to see how func3
performs with more data crunching given to np.sum
. 接下来的一个明显的想法是测试以查看
func3
如何执行更多数据处理给np.sum
。
Modified func3
(the version that uses slicing) to have a variable datasize for per iteration summing : 修改后的
func3
(使用切片的版本)为每次迭代求和提供变量datasize:
def func3(scale_factor = 4):
sum1 = 0
x = np.arange(0,1000)
for i in xrange(0,1000,scale_factor):
sum1 += np.sum(x[i:i+scale_factor]*2)
return sum1
Starting off with a scale_factor = 4
as used orginally - 从
scale_factor = 4
使用的scale_factor = 4
开始 -
In [83]: %timeit func1()
100000 loops, best of 3: 5.39 µs per loop
In [84]: %timeit func2()
10000 loops, best of 3: 39.8 µs per loop
In [85]: %timeit func3(scale_factor = 4)
1000 loops, best of 3: 741 µs per loop
Yes, func3
is slow. 是的,
func3
很慢。
Now, let's give more data per call to np.sum
ie increase scale_factor
- 现在,让我们为每次调用
np.sum
提供更多数据,即增加scale_factor
-
In [86]: %timeit func3(scale_factor = 8)
1000 loops, best of 3: 376 µs per loop
In [87]: %timeit func3(scale_factor = 20)
10000 loops, best of 3: 152 µs per loop
In [88]: %timeit func3(scale_factor = 100)
10000 loops, best of 3: 33.5 µs per loop
and so on until we feed in the entire data to np.sum
for the maximum limit to performance with np.sum
and minimum call overhead. 依此类推,直到我们将整个数据输入到
np.sum
,以获得np.sum
和最小调用开销的最大性能限制。
First of all, nobody would write the third variant in C, because the compiler should do the necessary optimizations. 首先,没有人会在C中编写第三个变体,因为编译器应该进行必要的优化。
So take the first one, you have two creations of numpy arrays (arange and *2) and one summation. 所以拿第一个,你有两个numpy数组(arange和* 2)和一个求和。 Creating complex objects like numpy arrays takes some time, but each vector operation is written in C code and very fast.
创建像numpy数组这样的复杂对象需要一些时间,但每个向量操作都是用C代码编写的,速度非常快。
The second one only uses primitive python operations (about 3000, iteration, multipication and summation), which are written in C and very fast. 第二个只使用原始python操作(大约3000,迭代,乘法和求和),它们用C语言写得非常快。
Third variant you create about 2 * 250 numpy arrays (a comparably slow operation), which leads to 100 times slower execution speed compared to only creating 2 numpy arrays. 您创建的第三个变体是大约2 * 250个numpy数组(相对较慢的操作),与仅创建2个numpy数组相比,执行速度慢100倍。
If you have concerns in respect to memory usage, you should use inline operations, which only creates one array: 如果您对内存使用有疑虑,则应使用内联操作,该操作仅创建一个数组:
x = np.arange(1000)
x *= 2
return x.sum()
If you still have to use too much memory, divide your operations in chunks as large as possible. 如果仍然需要使用太多内存,请将操作分成尽可能大的块。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.