简体   繁体   English

numpy中不同矢量化方法的表现

[英]Performance in different vectorization method in numpy

I wanted to test the performance of vectorizing code in python: 我想在python中测试矢量化代码的性能:

import timeit
import numpy as np

def func1():
  x = np.arange(1000)
  sum = np.sum(x*2)
  return sum

def func2():
  sum = 0
  for i in xrange(1000):
    sum += i*2
  return sum

def func3():
  sum = 0
  for i in xrange(0,1000,4):
    x = np.arange(i,i+4,1)
    sum += np.sum(x*2)
  return sum

print timeit.timeit(func1, number = 1000)
print timeit.timeit(func2, number = 1000)
print timeit.timeit(func3, number = 1000)

The code gives the following output: 代码提供以下输出:

0.0105729103088
0.069864988327
0.983253955841

The performance difference in the first and second functions are not surprising. 第一和第二功能的性能差异并不令人惊讶。 But I was surprised that the 3rd function is significantly slower than the other functions. 但我很惊讶第3个功能明显慢于其他功能。

I am much more familiar in vectorising code in C than in Python and the 3rd function is more C-like - running a for loop and processing 4 numbers in one instruction in each loop. 我在C中的代码中比在Python中更熟悉,第三个函数更像C - 运行for循环并在每个循环中的一条指令中处理4个数字。 To my understanding numpy calls a C function and then vectorize the code in C. So if this is the case my code is also passing 4 numbers to numpy each at a time. 根据我的理解,numpy调用C函数,然后在C中对代码进行矢量化。因此,如果是这种情况,我的代码也会一次传递4个数字到numpy。 The code shouldn't perform better when I pass more numbers at once. 当我一次传递更多数字时,代码应该不会更好。 So why is it much more slower? 那为什么它要慢得多呢? Is it because of the overhead in calling a numpy function? 是因为调用numpy函数的开销?

Besides, the reason that I even came up with the 3rd function in the first place is because I'm worried about the performance of the large amount of memory allocation to x in func1 . 此外,我甚至首先提出第三个功能的原因是因为我担心func1 x的大量内存分配的性能。

Is my worry valid? 我的担心有效吗? Why and how can I improve it or why not? 为什么以及如何改进它或为什么不改进?

Thanks in advance. 提前致谢。

Edit: 编辑:

For curiosity sake, although it defeats my original purpose for creating the 3rd version, I have looked into roganjosh's suggestion and tried the following edit. 为了好奇,尽管它打破了我创建第3版的最初目的,但我已经研究了roganjosh的建议,并尝试了以下编辑。

def func3():
  sum = 0
  x = np.arange(0,1000)
  for i in xrange(0,1000,4):
    sum += np.sum(x[i:i+4]*2)
  return sum

The output: 输出:

0.0104308128357
0.0630609989166
0.748773813248

There is an improvement, but still a large gap compared with the other functions. 虽然有所改进,但与其他功能相比仍有很大差距。

Is it because x[i:i+4] still creates a new array? 是因为x[i:i+4]仍然会创建一个新数组吗?

Edit 2: 编辑2:

I've modified the code again according to Daniel's suggestion. 我根据Daniel的建议再次修改了代码。

def func1():
  x = np.arange(1000)
  x *= 2
  return x.sum()

def func3():
  sum = 0
  x = np.arange(0,1000)
  for i in xrange(0,1000,4):
    x[i:i+4] *= 2
    sum += x[i:i+4].sum()
  return sum

The output: 输出:

0.00824999809265
0.0660569667816
0.598328828812

There is another speedup. 还有另一种加速。 So the declaration of numpy arrays are definitely a problem.Now in func3 there should be one array declaration only, but yet the time is still way slower. 所以numpy数组的声明肯定是个问题。现在func3中应该只有一个数组声明,但是时间仍然慢一些。 Is it because of the overhead of calling numpy arrays? 是因为调用numpy数组的开销?

It seems you're mostly interested in the difference between your function 3 compared to the pure NumPy (function 1) and Python (function 2) approaches. 看起来你最感兴趣的是你的函数3与 NumPy(函数1)和Python(函数2)方法之间的区别。 The answer is quite simple (especially if you look at function 4): 答案很简单(特别是如果你看一下函数4):

  • NumPy functions have a "huge" constant factor. NumPy函数具有“巨大的”常数因子。

You typically need several thousand elements to get in the regime where the runtime of np.sum actually depends on the number of elements in the array. 您通常需要数千个元素才能进入np.sum运行时实际上取决于数组中元素数量的机制。 Using IPython and matplotlib (the plot is at the end of the answer) you can easily check the runtime dependency: 使用IPython和matplotlib(情节在答案的最后),您可以轻松检查运行时依赖性:

import numpy as np

n = []
timing_sum1 = []
timing_sum2 = []
for i in range(1, 25):
    num = 2**i
    arr = np.arange(num)
    print(num)
    time1 = %timeit -o arr.sum()    # calling the method
    time2 = %timeit -o np.sum(arr)  # calling the function
    n.append(num)
    timing_sum1.append(time1)
    timing_sum2.append(time2)

The results for np.sum (shortened) are quite interesting: np.sum (缩短)的结果非常有趣:

4
22.6 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16
25.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64
25.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256
24.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1024
24.6 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4096
27.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16384
40.6 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65536
91.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
262144
394 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1048576
1.24 ms ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4194304
4.71 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16777216
18.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It seems the constant factor is roughly 20µs on my computer) and it takes an array with 16384 thousand elements to double that time. 似乎恒定因子在我的计算机上大约是20µs )并且它需要一个具有16384,000个元素的阵列来加倍。 So the timing for function 3 and 4 are mostly timing multiplicatives of the constant factor. 因此,函数3和4的时序主要是常数因子的时间乘法。

In function 3 you include the constant factor 2 times, once with np.sum and once with np.arange . 在函数3中,您包括常数因子2次,一次使用np.sum ,一次使用np.arange In this case arange is quite cheap because each array is the same size, so NumPy & Python & your OS probably reuse the memory of the array of the last iteration. 在这种情况下, arange非常便宜,因为每个数组的大小都相同,因此NumPy和Python以及您的操作系统可能会重用上一次迭代的数组内存。 However even that takes time (roughly 2µs for very small arrays on my computer). 然而,即使这需要时间(对于我的计算机上的非常小的阵列,大约需要2µs )。

More generally: To identify bottlenecks you should always profile the functions! 更一般地说:要识别瓶颈,您应该始终分析功能!

I show the results for the functions with line-profiler . 我用line-profiler显示函数的结果。 Therefore I altered the functions a bit so they only do one operation per line: 因此我稍微改变了函数,因此它们每行只执行一次操作:

import numpy as np

def func1():
    x = np.arange(1000)
    x = x*2
    return np.sum(x)

def func2():
    sum_ = 0
    for i in range(1000):
        tmp = i*2
        sum_ += tmp
    return sum_

def func3():
    sum_ = 0
    for i in range(0, 1000, 4):  # I'm using python3, so "range" is like "xrange"!
        x = np.arange(i, i + 4, 1)
        x = x * 2
        tmp = np.sum(x)
        sum_ += tmp
    return sum_

def func4():
    sum_ = 0
    x = np.arange(1000)
    for i in range(0, 1000, 4):
        y = x[i:i + 4]
        y = y * 2
        tmp = np.sum(y)
        sum_ += tmp
    return sum_

Results: 结果:

%load_ext line_profiler

%lprun -f func1 func1()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def func1():
     5         1           62     62.0     23.8      x = np.arange(1000)
     6         1           65     65.0     24.9      x = x*2
     7         1          134    134.0     51.3      return np.sum(x)

%lprun -f func2 func2()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           def func2():
    10         1            7      7.0      0.1      sum_ = 0
    11      1001         2523      2.5     30.9      for i in range(1000):
    12      1000         2819      2.8     34.5          tmp = i*2
    13      1000         2819      2.8     34.5          sum_ += tmp
    14         1            3      3.0      0.0      return sum_

%lprun -f func3 func3()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                           def func3():
    17         1            7      7.0      0.0      sum_ = 0
    18       251          909      3.6      2.9      for i in range(0, 1000, 4):
    19       250         6527     26.1     21.2          x = np.arange(i, i + 4, 1)
    20       250         5615     22.5     18.2          x = x * 2
    21       250        16053     64.2     52.1          tmp = np.sum(x)
    22       250         1720      6.9      5.6          sum_ += tmp
    23         1            3      3.0      0.0      return sum_

%lprun -f func4 func4()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                           def func4():
    26         1            7      7.0      0.0      sum_ = 0
    27         1           49     49.0      0.2      x = np.arange(1000)
    28       251          892      3.6      3.4      for i in range(0, 1000, 4):
    29       250         2177      8.7      8.3          y = x[i:i + 4]
    30       250         5431     21.7     20.7          y = y * 2
    31       250        15990     64.0     60.9          tmp = np.sum(y)
    32       250         1686      6.7      6.4          sum_ += tmp
    33         1            3      3.0      0.0      return sum_

I won't go into the details of the results, but as you can see np.sum is definetly the bottleneck in func3 and func4 . 我不会详细介绍结果,但正如你所看到的, np.sumfunc3func4的瓶颈。 I already guessed that np.sum is the bottleneck before I wrote the answer but these line-profilings actually verify that it is the bottleneck. 我已经猜到np.sum是瓶颈之前我写的答案,但这些线路实际上成型部验证它是瓶颈

Which leads to a very important fact when using NumPy: 这在使用NumPy时会产生一个非常重要的事实:

  • Know when to use it! 知道什么时候使用它! Small arrays aren't worth it (mostly). 小阵列不值得(大多数情况下)。
  • Know the NumPy functions and just use them. 了解NumPy功能并使用它们。 They already use (if avaiable) compiler optimization flags to unroll loops. 他们已经使用(如果可用的话)编译器优化标志来展开循环。

If you really believe some part is too slow then you can use: 如果你真的相信某些部分太慢,那么你可以使用:

  • NumPy's C API and process the array with C (can be really easy with Cython but you can also do it manually) NumPy的C API并使用C处理数组(使用Cython可以非常简单,但您也可以手动执行)
  • Numba (based on LLVM). Numba(基于LLVM)。

But generally you probably can't beat NumPy for moderatly sized (several thousand entries and more) arrays. 但一般来说,你可能无法击败NumPy的数量级(数千个条目和更多)阵列。


Visualization of the timings: 可视化的时间:

%matplotlib notebook

import matplotlib.pyplot as plt

# Average time per sum-call
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average for time in timing_sum1], label='arr.sum()', c='red')
ax.plot(n, [time.average for time in timing_sum2], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time it takes to sum them [seconds]')
ax.grid(which='both')
ax.legend()

# Average time per element
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum1)], label='arr.sum()', c='red')
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum2)], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time per element [seconds / element]')
ax.grid(which='both')
ax.legend()

The plots are log-log, I think it was the best way to visualize the data given that it extends several orders of magnitude (I just hope it's still understandable). 这些图是log-log,我认为这是可视化数据的最佳方式,因为它扩展了几个数量级(我只希望它仍然可以理解)。

The first plot shows how much time it takes to do the sum : 第一个图表显示了sum所需的时间:

在此输入图像描述

The second plot shows the average time it takes to do the sum divided by the number of elements in the array. 第二个图显示了sum除以数组中元素数所需的平均时间。 This is just another way to interpret the data: 这只是解释数据的另一种方式:

在此输入图像描述

Based on the tests (shown next) it seems, you are beaten by the functional call overhead. 根据测试(如下所示),您似乎被功能调用开销所打败。 Alongwith the vectorized capability of NumPy functions/tools, we need to give it enough data for crunching. 除了NumPy函数/工具的矢量化功能外,我们还需要为其提供足够的数据来进行运算。 With func3 , we are giving it just 4 elements per call to np.sum . 使用func3 ,我们每次调用np.sum只给它4元素。

Let's investigate the per call overhead for np.sum . 让我们研究一下np.sum的每次调用开销。 Here's np.sum starting with summing of none, one element and onwards - 这里是np.sum ,一个元素和向前的总和开始 -

In [90]: a = np.array([])

In [91]: %timeit np.sum(a)
1000000 loops, best of 3: 1.6 µs per loop

In [61]: a = np.array([0])

In [62]: %timeit np.sum(a)
1000000 loops, best of 3: 1.66 µs per loop

In [63]: a = np.random.randint(0,9,(100))

In [64]: %timeit np.sum(a)
100000 loops, best of 3: 1.79 µs per loop

In [65]: a = np.random.randint(0,9,(1000))

In [66]: %timeit np.sum(a)
100000 loops, best of 3: 2.25 µs per loop

In [67]: a = np.random.randint(0,9,(10000))

In [68]: %timeit np.sum(a)
100000 loops, best of 3: 7.27 µs per loop

and so on. 等等。

Thus, we would incur a minimum of around 1.6 u-sec per call to np.sum on the system setup for these tests. 因此,对于这些测试的系统设置,我们将在每次调用np.sum时产生至少约1.6 u-sec

Let's see how the scalar addition with the addition operator performs - 让我们看一下加法运算符的标量加法如何执行 -

In [98]: def add_nums(a,b):
    ...:     return a+b
    ...: 

In [99]: %timeit add_nums(2,3)
10000000 loops, best of 3: 71.5 ns per loop

This is about 25x faster than the per call overhead for np.sum . 这比np.sum的每次调用开销快约25x

The obvious idea next up is to test out to see how func3 performs with more data crunching given to np.sum . 接下来的一个明显的想法是测试以查看func3如何执行更多数据处理给np.sum

Modified func3 (the version that uses slicing) to have a variable datasize for per iteration summing : 修改后的func3 (使用切片的版本)为每次迭代求和提供变量datasize:

def func3(scale_factor = 4):
    sum1 = 0
    x = np.arange(0,1000)
    for i in xrange(0,1000,scale_factor):
        sum1 += np.sum(x[i:i+scale_factor]*2)
    return sum1

Starting off with a scale_factor = 4 as used orginally - scale_factor = 4使用的scale_factor = 4开始 -

In [83]: %timeit func1()
100000 loops, best of 3: 5.39 µs per loop

In [84]: %timeit func2()
10000 loops, best of 3: 39.8 µs per loop

In [85]: %timeit func3(scale_factor = 4)
1000 loops, best of 3: 741 µs per loop

Yes, func3 is slow. 是的, func3很慢。

Now, let's give more data per call to np.sum ie increase scale_factor - 现在,让我们为每次调用np.sum提供更多数据,即增加scale_factor -

In [86]: %timeit func3(scale_factor = 8)
1000 loops, best of 3: 376 µs per loop

In [87]: %timeit func3(scale_factor = 20)
10000 loops, best of 3: 152 µs per loop

In [88]: %timeit func3(scale_factor = 100)
10000 loops, best of 3: 33.5 µs per loop

and so on until we feed in the entire data to np.sum for the maximum limit to performance with np.sum and minimum call overhead. 依此类推,直到我们将整个数据输入到np.sum ,以获得np.sum和最小调用开销的最大性能限制。

First of all, nobody would write the third variant in C, because the compiler should do the necessary optimizations. 首先,没有人会在C中编写第三个变体,因为编译器应该进行必要的优化。

So take the first one, you have two creations of numpy arrays (arange and *2) and one summation. 所以拿第一个,你有两个numpy数组(arange和* 2)和一个求和。 Creating complex objects like numpy arrays takes some time, but each vector operation is written in C code and very fast. 创建像numpy数组这样的复杂对象需要一些时间,但每个向量操作都是用C代码编写的,速度非常快。

The second one only uses primitive python operations (about 3000, iteration, multipication and summation), which are written in C and very fast. 第二个只使用原始python操作(大约3000,迭代,乘法和求和),它们用C语言写得非常快。

Third variant you create about 2 * 250 numpy arrays (a comparably slow operation), which leads to 100 times slower execution speed compared to only creating 2 numpy arrays. 您创建的第三个变体是大约2 * 250个numpy数组(相对较慢的操作),与仅创建2个numpy数组相比,执行速度慢100倍。

If you have concerns in respect to memory usage, you should use inline operations, which only creates one array: 如果您对内存使用有疑虑,则应使用内联操作,该操作仅创建一个数组:

x = np.arange(1000)
x *= 2
return x.sum()

If you still have to use too much memory, divide your operations in chunks as large as possible. 如果仍然需要使用太多内存,请将操作分成尽可能大的块。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM