Numpy 数组和 for 循环：如何改进它

Question

I am new with Python and I have a question about a for loop to speed up.我是 Python 的新手，我有一个关于 for 循环加速的问题。

Let "u" be an numpy array of dimension (N,K) and let "kernel_vect" be a numpy array of dimension (K,) both of float64 numbers.让“u”是一个维度为 (N,K) 的 numpy 数组，让“kernel_vect”是一个维度为 (K,) 的 numpy 数组，这两个数组都是 float64 数字。 I would like to speed up the following code (by eliminating the for loop for example)我想加快下面的代码（例如通过消除 for 循环）

Kernel_appo = np.zeros((N**2,))
    for k in range(K):
        uk = u[:,k]
        Mat_appo = np.outer(uk,uk)
        Kernel_appo = Kernel_appo  + kernel_vect[k] * routines.vec(Mat_appo)

Any idea?任何想法？ Thanks!谢谢！

Answer 1

Here a faster implementation without a loop over k:这是一个更快的实现，没有对 k 的循环：

# Version 2
Kernel_appo = np.zeros((N**2,))
for n1 in range(N):
    for n2 in range(N):
        Kernel_appo[n1*N+n2] = (u[n1,:] * u[n2,:] * kernel_vect).sum()

We can make it faster by using the symmetry of the u product:我们可以通过使用 u 乘积的对称性使其更快：

# Version 3
Kernel_appo = np.zeros((N,N))
for n1 in range(N):
    for n2 in range(n1,N):
        Kernel_appo[n1,n2] = (u[n1,:] * u[n2,:] * kernel_vect).sum()
Kernel_appo = np.triu(Kernel_appo, 1) + np.tril(Kernel_appo.transpose(), 0) # make the matrix symmetric
Kernel_appo = np.ravel(Kernel_appo, order='C')

Here is a version that remove one of the loop:这是一个删除循环之一的版本：

# Version 4
Kernel_appo = np.zeros((N,N))
for n1 in range(N):
    Kernel_appo[n1,n1:N] = ((u[n1,:] * kernel_vect) * u[n1:N,:]).sum(axis=1)
Kernel_appo = np.triu(Kernel_appo, 1) + np.tril(Kernel_appo.transpose(), 0)
Kernel_appo = np.ravel(Kernel_appo, order='C')

We still have a loop over N. However, this seems reasonable to keep it since N is small.我们仍然有一个关于 N 的循环。但是，由于 N 很小，因此保留它似乎是合理的。 Removing it will certainly force numpy to create huge matrices in memory which should cause a performance drop (and can even crash if N and K a very big).删除它肯定会迫使 numpy 在 memory 中创建巨大的矩阵，这会导致性能下降（如果 N 和 K 非常大，甚至会崩溃）。

Note that Version 4 will probably not be so fast if K is much bigger (since temporary numpy matrices could not fit in the cache of the CPU).请注意，如果 K 更大，版本 4 可能不会那么快（因为临时 numpy 矩阵无法放入 CPU 的缓存中）。

UPDATE : I just discover that this is possible to use the awesome np.einsum in this case:更新：我只是发现在这种情况下可以使用很棒的np.einsum ：

# Version 5
Kernel_appo = np.ravel(np.einsum('ji,ki,i->jk', u, u, kernel_vect, optimize=True), order='C')

Be prepared, because this simpler implementation is also much faster (because numpy is able to vectorize the code and run it in parallel).做好准备，因为这个更简单的实现也快得多（因为 numpy 能够向量化代码并并行运行）。

Here are performance results with N=50 and K=5000 on my machine:以下是我的机器上 N=50 和 K=5000 的性能结果：

Initial code: 58.15 ms
Version 2:    19.94 ms
Version 3:    10.11 ms
Version 4:     5.08 ms
Version 5:     0.57 ms

The final implementation is now about 100 times faster than the initial one!最终的实现现在比最初的快大约 100 倍！

Numpy 数组和 for 循环：如何改进它

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-04-25 22:58:29

Numpy 数组和 for 循环：如何改进它

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-04-25 22:58:29

解决方案1
0 已采纳 2020-04-25 22:58:29