简体   繁体   English

numpy中矢量化后的性能损失

[英]performance loss after vectorization in numpy

I am writing a time consuming program. 我正在写一个耗时的程序。 To reduce the time, I have tried my best to use numpy.dot instead of for loops. 为了减少时间,我尽力使用numpy.dot而不是for循环。

However, I found vectorized program to have much worse performance than the for loop version: 但是,我发现矢量化程序的性能比for循环版本差得多:

import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
    kpt = kpt_list[i, :]
    phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
    kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583

What is happening here? 这里发生了什么?

The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier: 我建议您做的第一件事是将脚本分解为单独的函数,以便更轻松地进行分析和调试:

def setup(n1=10000, n2=1000, n3=20, seed=None):

    gen = np.random.RandomState(seed)
    kpt_list = gen.randn(n1, n3).astype(np.float)
    rpt_list = gen.randn(n2, n3).astype(np.float)
    h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
    r_ndegen = gen.randn(1000).astype(np.float)

    return kpt_list, rpt_list, h_r, r_ndegen


def original_vec(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    kpt_data = h_r.dot(phase)

    return kpt_data


def original_loop(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    kpt_data = np.zeros((20, 20, 10000), dtype='complex')
    for i in range(10000):
        kpt = kpt_list[i, :]
        phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
        kpt_data[:, :, i] = h_r.dot(phase)

    return kpt_data

I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). 我还强烈建议使用随机数据而不是全零或全一数组,除非这是你的实际数据(!)。 This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code. 这样可以更容易地检查代码的正确性 - 例如,如果您的最后一步是乘以一个零的矩阵,那么无论先前是否存在错误,您的输出将始终为全零。你的代码。


Next, I would run these functions through line_profiler to see where they are spending most of their time. 接下来,我将通过line_profiler运行这些函数,以查看他们花费大部分时间的位置。 In particular, for original_vec : 特别是对于original_vec

In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s

Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    12                                           def original_vec(*args, **kwargs):
    13                                           
    14         1        86498  86498.0      0.4      kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
    15                                           
    16         1        69700  69700.0      0.3      r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    17         1      1331947 1331947.0      5.6      phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    18         1     22271637 22271637.0     93.7      kpt_data = h_r.dot(phase)
    19                                           
    20         1            4      4.0      0.0      return kpt_data

You can see that it spends 93% of its time computing the dot product between h_r and phase . 您可以看到它花费93%的时间来计算h_rphase之间的点积。 Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000) . 这里, h_r(20, 20, 1000) h_r (20, 20, 1000)阵列, phase(1000, 10000) h_r (1000, 10000) We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl ). 我们计算了h_r的最后一个维度和phase的第一个维度的和积(你可以用einsum表示法写成ijk,kl->ijl )。


The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. h_r的前两个维度在这里并不重要 - 我们可以在获取点积之前轻松地将h_r重塑为(20*20, 1000) h_r (20*20, 1000)数组。 It turns out that this reshaping operation by itself gives a huge performance improvement: 事实证明,这种重塑操作本身可以带来巨大的性能提升:

In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop

In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop

I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. 我不完全确定为什么会出现这种情况 - 我希望numpy的dot函数足够聪明,可以自动应用这个简单的优化。 On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines. 在我的笔记本电脑上,第二种情况似乎使用多个线程,而第一种情况则没有,这表明它可能不会调用多线程BLAS例程。


Here's a vectorized version that incorporates the reshaping operation: 这是一个包含整形操作的矢量化版本:

def new_vec(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
    kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)

    return kpt_data.reshape(h_r.shape[:2] + (-1,))

The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. -1索引告诉numpy根据其他维度和数组中元素的数量推断出这些维度的大小。 I've also used broadcasting to divide by r_ndegen , which eliminates the need for np.tile . 我还使用广播除以r_ndegen ,这消除了对np.tile的需要。

By using the same random input data, we can check that the new version gives the same result as the original: 通过使用相同的随机输入数据,我们可以检查新版本是否提供与原始版本相同的结果:

In [4]: ans1 = original_loop(seed=0)

In [5]: ans2 = new_vec(seed=0)    
In [6]: np.allclose(ans1, ans2)
Out[6]: True

Some performance benchmarks: 一些性能基准测试:

In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop

In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop

In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop

Update: 更新:

I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. 我很好奇为什么np.dot对于原始(20, 20, 1000) h_r (20, 20, 1000) h_r数组来说要慢得多,所以我挖到了numpy源代码。 The logic implemented in multiarraymodule.c turns out to be shockingly simple: multiarraymodule.c实现的逻辑变得非常简单:

#if defined(HAVE_CBLAS)
    if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

In other words numpy just checks whether either of the input arrays has > 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. 换句话说,numpy只检查输入数组中的任何一个是否具有> 2维,并立即回退到矩阵 - 矩阵乘法的非BLAS实现。 It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. 看起来检查两个阵列的内部尺寸是否兼容似乎不太困难,如果是这样,则将它们视为2D并对它们执行*gemm矩阵 - 矩阵乘法。 In fact there's an open feature request for this dating back to 2012 , if any numpy devs are reading... 实际上有一个开放的功能请求,可以追溯到2012年 ,如果任何numpy开发者正在阅读...

In the meantime, it's a nice performance trick to be aware of when multiplying tensors. 与此同时,在倍增张量时,要注意这是一个很好的性能技巧。


Update 2: 更新2:

I forgot about np.tensordot . 我忘记了np.tensordot Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations: 由于它在2D数组上调用与np.dot相同的底层BLAS例程,因此它可以实现相同的性能提升,但没有所有那些丑陋的reshape操作:

In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

I suspect the first operation is hitting the the resource limit. 我怀疑第一个操作正在达到资源限制。 May be you can benefit from these two questions: Efficient dot products of large memory-mapped arrays , and Dot product of huge arrays in numpy . 也许你可以从这两个问题中受益: 大型内存映射数组的高效点产品 ,以及numpy中巨大数组的Dot产品

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM