为什么矢量化的numpy代码比for循环慢？

Question

I have two numpy arrays, X and Y , with shapes (n,d) and (m,d) , respectively. 我有两个numpy数组X和Y ，分别具有形状(n,d)和(m,d) 。 Assume that we want to compute the Euclidean distances between each row of X and each row of Y and store the result in array Z with shape (n,m) . 假设我们要计算X每一行和Y每一行之间的欧几里得距离，并将结果存储在形状为(n,m)数组Z 。 I have two implementations for this. 我有两个实现。 The first implementation uses two for loops as follows: 第一个实现使用两个for循环，如下所示：

for i in range(n):
      for j in range(m):
        Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j])))

The second implementation uses only one loop by vectorization: 第二种实现通过矢量化仅使用一个循环：

for i in range(n):
      Z[i] = np.sqrt(np.sum(np.square(X[i]-Y), axis=1))

When I run these codes on a particular X and Y data, the first implementation takes nearly 30 seconds while the second implementation takes nearly 60 seconds. 当我在特定的X和Y数据上运行这些代码时，第一个实现花费了将近30秒，而第二个实现花费了将近60秒。 I expect the second implementation to be faster since it uses vectorization. 我希望第二个实现会更快，因为它使用了向量化。 What is the reason of its slow running? 它运行缓慢的原因是什么？ I know that we can obtain faster implementations by fully vectorizing the code, but I don't understand why the second code (which is partially vectorized) is slower than non-vectorized version. 我知道我们可以通过对代码进行完全向量化来获得更快的实现，但是我不明白为什么第二个代码（部分向量化）比非向量化版本慢。

Here is the complete code: 这是完整的代码：

n,m,d = 5000,500,3000
X = np.random.rand(n,d)
Y = np.random.rand(m,d)
Z = np.zeros((n,m))

tic = time.time()
for i in range(n):
      for j in range(m):
        Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j])))
print('Elapsed time 1: ', time.time()-tic)

tic = time.time()
for i in range(n):
      Z[i] = np.sqrt(np.sum(np.square(X[i]-Y), axis=1))
print('Elapsed time 2: ', time.time()-tic)


tic = time.time()
train_squared = np.square(X).sum(axis=1).reshape((1,n))
test_squared = np.square(Y).sum(axis=1).reshape((m,1))
test_train = -2*np.matmul(Y, X.T)
dists = np.sqrt(test_train + train_squared + test_squared)
print('Elapsed time 3: ', time.time()-tic)

And this is the output: 这是输出：

Elapsed time 1:  35.659096002578735
Elapsed time 2:  65.57051086425781
Elapsed time 3:  0.3912069797515869

Answer 1

I pulled apart your equations and reduced it down to this MVCE : 我分解了您的方程式，并将其简化为该MVCE ：

for i in range(n):
    for j in range(m):
        Y[j].copy()

for i in range(n):
    Y.copy()

The copy() here is just to simulate the subtraction from X . 这里的copy()只是为了模拟X的减法。 The subtraction itself should be quite cheap. 减法本身应该是很便宜的。

Here's the results on my computer: 这是我计算机上的结果：

The first one took 10ms. 第一个耗时10ms。
The second one took 13s! 第二个花了13秒！

I'm copying the exact same amount of data. 我正在复制完全相同的数据量。 Using your choices n=5000, m=500, d=3000 , this code is copying 60 gigabytes of data. 使用您的选择n=5000, m=500, d=3000 ，此代码将复制60 GB的数据。

To be honest, I'm not surprised at all that 13 seconds. 老实说，那13秒我一点也不惊讶。 That's already over 4GB/s, essentially the maximum bandwidth between my CPU and RAM (of eg memcpy ). 那已经超过了4GB / s，基本上是我的CPU和RAM之间的最大带宽（例如memcpy ）。

The really surprising thing is that the first test managed to copy 60GB in only 0.01seconds, which translates to 6TB/s! 真正令人惊讶的是，第一个测试仅在0.01秒内成功复制了60GB，相当于6TB / s！

I'm pretty sure this is because the data isn't actually leaving the CPU at all. 我很确定这是因为数据实际上并没有离开CPU。 It's just bouncing back and forth between the CPU and the L1 cache: an array of 3000 double-precision numbers will easily fit in a 32KiB L1 cache. 它只是在CPU和L1缓存之间来回跳动：3000个双精度数字的数组将很容易装入32KiB L1缓存中。

Therefore, I deduce that the main reason your second algorithm isn't as great as one would naively expect is because processing a whole chunk of 500×3000 elements per iteration is very unfriendly to the CPU cache: you basically evict the whole cache into RAM! 因此，我推断出您的第二个算法不如天真的期望那么大的主要原因是因为每次迭代处理整个块500×3000元素对CPU缓存非常不友好：您基本上将整个缓存逐出到RAM ！ In contrast, your first algorithm is does take advantage of cache to some extent, because the 3000 elements will still be in cache by the time the sum gets computed, so there's not nearly as much data moving between your CPU and RAM. 相比之下，您的第一个算法确实在某种程度上利用了缓存，因为在计算sum时，仍有3000元素在缓存中，因此在CPU和RAM之间移动的数据几乎没有。 (Once you have the sum, the 3000 element array is "thrown away", which means it will probably just get overwritten in cache and never make it back to the actually RAM.) （一旦获得总和，就会“丢弃” 3000元素的数组，这意味着它可能只会在缓存中被覆盖，而永远不会回到实际的RAM中。）

Naturally, doing matrix multiplication is insanely faster, because your problem is essentially of the form: 自然地，进行矩阵乘法会更快，因为您的问题本质上是以下形式：

C[i, j] = ∑[k] f(A[i, k], B[j, k])

If you replace f(x, y) with x * y , you can see it's just a variant of matrix multiplication. 如果将f(x, y)替换为x * y ，则可以看到它只是矩阵乘法的一种形式。 The operation f is not extremely important here − what is important are how the indices behave in this equation, which determines how your arrays are stored in memory. 这里的操作f并不是非常重要-重要的是索引在此方程式中的行为方式，这决定了数组在内存中的存储方式。 The essence of matrix multiplication algorithms lies in the ability to cope with this kind of array access through blocking , so in principle the overall algorithm does not change dramatically even for a user-defined f . 矩阵乘法算法的本质在于通过阻塞处理这种数组访问的能力，因此，原则上，即使对于用户定义的f ，整个算法也不会发生很大变化。 Unfortunately, in practice there are very few libraries that support user-defined operations, so you have use the trick (X - Y)**2 = X**2 - 2 XY + Y**2 as you have done. 不幸的是， 实际上 ，很少有库支持用户定义的操作，因此您已经使用了技巧(X - Y)**2 = X**2 - 2 XY + Y**2 。 But it gets the job done :D 但这完成了工作：D

为什么矢量化的numpy代码比for循环慢？

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-07-13 09:47:22

为什么矢量化的numpy代码比for循环慢？

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-07-13 09:47:22

解决方案1
4 已采纳 2017-07-13 09:47:22