简体   繁体   English

在主循环外广播可以加快矢量化的numpy运算速度?

[英]Broadcasting outside main loop speeds up vectorized numpy ops?

I'm doing some vectorized algebra using numpy and the wall-clock performance of my algorithm seems weird. 我正在使用numpy做一些矢量化代数,而我的算法的挂钟性能似乎很奇怪。 The program does roughly as follows: 该程序大致执行以下操作:

  1. Create three matrices: Y (KxD), X (NxD), T (KxN) 创建三个矩阵: Y (KxD), X (NxD), T (KxN)
  2. For each row of Y : 对于Y每一行:
  3. subtract Y[i] from each row of X (by broadcasting), 减去Y[i]从每行X (通过广播),
  4. square the differences along one axis, sum them, take a square root, then store in T . 将差沿一个轴平方,求和,取平方根,然后存储在T

However, depending on how I perform the broadcasting, computation speed is vastly different. 但是,根据我进行广播的方式,计算速度有很大不同。 Consider the code: 考虑一下代码:

import numpy as np
from time import perf_counter

D = 128
N = 3000
K = 500

X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))

if True: # negate to enable the second loop
    time = 0.0
    for i in range(100):
        start = perf_counter()
        for i in range(K):
            T[i] = np.sqrt(np.sum(
                np.square(
                  X - Y[i] # this has dimensions NxD
                ),
                axis=1
            ))
        time += perf_counter() - start
    print("Broadcast in line: {:.3f} s".format(time / 100))
    exit()

if True:
    time = 0.0
    for i in range(100):
        start = perf_counter()
        for i in range(K):
            diff = X - Y[i]
            T[i] = np.sqrt(np.sum(
                np.square(
                  diff
                ),
                axis=1
            ))
        time += perf_counter() - start
    print("Broadcast out:     {:.3f} s".format(time / 100))
    exit()

Times for each loop are measured individually and averaged over 100 executions. 每个循环的时间分别进行测量,并平均执行100次。 The results: 结果:

Broadcast in line: 1.504 s
Broadcast out:     0.438 s

The only difference is that broadcasting and subtraction in the first loop is done in-line, while in the second approach I do it before any vectorized operations. 唯一的区别是,第一个循环中的广播和减法是在线完成的,而第二种方法中,我在进行任何矢量化运算之前都进行了广播和减法。 Why is this making such a difference? 为什么这有很大的不同?

My system configuration: 我的系统配置:

  • Lenovo ThinkStation P920, 2x Xeon Silver 4110, 64 GB RAM Lenovo ThinkStation P920、2个Xeon Silver 4110、64 GB RAM
  • Xubuntu 18.04.2 LTS (bionic) Xubuntu 18.04.2 LTS(仿生)
  • Python 3.7.3 (GCC 7.3.0) Python 3.7.3(GCC 7.3.0)
  • Numpy 1.16.3 linked against OpenBLAS (that's as much as np.__config__.show() tells me) 与OpenBLAS链接的Numpy 1.16.3(这与np.__config__.show()告诉我的一样多)

PS: Yes I am aware this could be further optimized, but right now I would like to understand what happens under the hood here. PS:是的,我知道可以进一步优化,但是现在我想了解这里发生的情况。

It's not a broadcasting problem 这不是广播问题

I also added a optimized solution to see how long the actual calculation takes without the large overhead of memory allocation and deallocation. 我还添加了一个优化的解决方案,以查看实际的计算需要花费多长时间,而不会占用大量的内存分配和释放。

Functions 功能

import numpy as np
import numba as nb

def func_1(X,Y,T):
    for i in range(K):
        T[i] = np.sqrt(np.sum(np.square(X - Y[i]),axis=1))
    return T

def func_2(X,Y,T):
    for i in range(K):
        diff = X - Y[i]
        T[i] = np.sqrt(np.sum(np.square(diff),axis=1))
    return T

@nb.njit(fastmath=True,parallel=True)
def func_3(X,Y,T):
    for i in nb.prange(Y.shape[0]):
        for j in range(X.shape[0]):
            diff_sq_sum=0.
            for k in range(X.shape[1]):
                diff_sq_sum+= (X[j,k] - Y[i,k])**2
            T[i,j]=np.sqrt(diff_sq_sum)
    return T

Timings 时机

I did all the timings in a Jupyter Notebook and observed a really weird behavior. 我在Jupyter笔记本中进行了所有计时,并观察到一个非常奇怪的行为。 The following code is in one cell. 以下代码在一个单元格中。 I also tried calling timit multiple times, but on the first execution of the cell this doesn't change anything. 我也尝试多次调用timit,但是在单元的第一次执行时,它什么都没有改变。

First execution of the cell 第一次执行单元

D = 128
N = 3000
K = 500

X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))

#You can do it more often it would not change anything
%timeit func_1(X,Y,T)
%timeit func_1(X,Y,T)

#You can do it more often it would not change anything
%timeit func_2(X,Y,T)
%timeit func_2(X,Y,T)

###Avoid measuring compilation overhead###
%timeit func_3(X,Y,T)
##########################################
%timeit func_3(X,Y,T)

774 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
768 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.74 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Second execution 第二次执行

345 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
337 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
322 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.93 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM