为什么 B = numpy.dot(A,x) 通过执行 B[i,:,:] = numpy.dot(A[i,:,:],x) ) 循环慢得多？

Question

I'm getting some efficiency test results that I can't explain.我得到了一些无法解释的效率测试结果。

I want to assemble a matrix B whose i-th entries B[i,:,:] = A[i,:,:].dot(x), where each A[i,:,:] is a 2D matrix, and so is x.我想组装一个矩阵 B，其第 i 个条目 B[i,:,:] = A[i,:,:].dot(x)，其中每个 A[i,:,:] 是一个二维矩阵， x 也是。

I can do this three ways, to test performance I make random ( numpy.random.randn ) matrices A = (10,1000,1000), x = (1000,1200).我可以通过三种方式来测试性能，我制作随机（ numpy.random.randn ）矩阵 A = (10,1000,1000), x = (1000,1200)。 I get the following time results:我得到以下时间结果：

(1) single multidimensional dot product (1)单个多维点积

B = A.dot(x)

total time: 102.361 s

(2) looping through i and performing 2D dot products (2) 遍历 i 并执行 2D 点积

   # initialize B = np.zeros([dim1, dim2, dim3])
   for i in range(A.shape[0]):
       B[i,:,:] = A[i,:,:].dot(x)

total time: 0.826 s

(3) numpy.einsum (3) numpy.einsum

B3 = np.einsum("ijk, kl -> ijl", A, x)

total time: 8.289 s

So, option (2) is the fastest by far.因此，选项（2）是迄今为止最快的。 But, considering just (1) and (2), I don't see the big difference between them.但是，仅考虑（1）和（2），我看不出它们之间有什么大的区别。 How can looping through and doing 2D dot products be ~ 124 times faster?循环和执行 2D 点积如何能快 124 倍？ They both use numpy.dot.他们都使用 numpy.dot。 Any insights?任何见解？

I include the code used for the above results just below:我在下面包含了用于上述结果的代码：

import numpy as np
import numpy.random as npr
import time

dim1, dim2, dim3 = 10, 1000, 1200
A = npr.randn(dim1, dim2, dim2)
x = npr.randn(dim2, dim3)

# consider three ways of assembling the same matrix B: B1, B2, B3

t = time.time()
B1 = np.dot(A,x)
td1 = time.time() - t
print "a single dot product of A [shape = (%d, %d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
  % (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td1)


B2 = np.zeros([A.shape[0], x.shape[0], x.shape[1]])
t = time.time()
for i in range(A.shape[0]):
    B2[i,:,:] = np.dot(A[i,:,:], x)
td2 = time.time() - t
print "taking %d dot products of 2D dot products A[i,:,:] [shape = (%d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
  % (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td2)

t = time.time()
B3 = np.einsum("ijk, kl -> ijl", A, x)
td3 = time.time() - t
print "using np.einsum, it completes in %.3f s" % td3

Answer 1

With smaller dims 10,100,200 , I get a similar ranking使用较小的10,100,200 ，我得到了类似的排名

In [355]: %%timeit
   .....: B=np.zeros((N,M,L))
   .....: for i in range(N):
              B[i,:,:]=np.dot(A[i,:,:],x)
   .....: 
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop

In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop

In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop

the itererative is faster, though not by as much as in your case.迭代速度更快，尽管不如您的情况快。

This is probably true as long as that iterating dimension is small compared to the other ones.只要迭代维度与其他维度相比较小，这可能是正确的。 In that case the overhead of iteration (function calls etc) is small compared to the calculation time.在这种情况下，迭代（函数调用等）的开销与计算时间相比很小。 And doing all the values at once uses more memory.并且一次执行所有值会使用更多内存。

I tried a dot variation where I reshaped A into 2d, thinking that dot does that kind of reshaping internally.我尝试了一个dot变体，我将A重塑为 2d，认为该dot在内部进行了这种重塑。 I'm surprised that it is actually fastest.我很惊讶它实际上是最快的。 tensordot is probably doing the same reshaping (that code if Python readable). tensordot可能正在做同样的重塑（如果 Python 可读的代码）。

einsum sets up a 'sum of products' iteration involving 4 variables, the i,j,k,m - that is dim1*dim2*dim2*dim3 steps with the C level nditer . einsum设置了一个涉及 4 个变量的“乘积总和”迭代，即i,j,k,m - 即带有 C 级nditer dim1*dim2*dim2*dim3步骤。 So the more indices you have the larger the iteration space.因此，您拥有的索引越多，迭代空间就越大。

Answer 2

numpy.dot only delegates to a BLAS matrix multiply when the inputs each have dimension at most 2 : 当每个输入的维度最多为 2 时， numpy.dot仅委托给BLAS矩阵乘法：

#if defined(HAVE_CBLAS)
    if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

When you stick your whole 3-dimensional A array into dot , NumPy takes a slower path, going through an nditer object.当您将整个 3 维A数组粘贴到dot ，NumPy 会采用较慢的路径，通过nditer对象。 It still tries to get some use out of BLAS in the slow path, but the way the slow path is designed, it can only use a vector-vector multiply rather than a matrix-matrix multiply, which doesn't give the BLAS anywhere near as much room to optimize. 它仍然试图在慢速路径中使用 BLAS ，但是慢速路径的设计方式只能使用向量-向量乘法而不是矩阵-矩阵乘法，这不会使 BLAS 接近尽可能多的优化空间。

Answer 3

I am not too familiar with numpy's C-API, and the numpy.dot is one such builtin function that used to be under _dotblas in earlier versions.我对 numpy 的 C-API 不太熟悉， numpy.dot就是一个这样的内置函数，它曾经在早期版本的_dotblas下。

Nevertheless, here are my thoughts.尽管如此，这是我的想法。

1) numpy.dot takes different paths for 2-dimensional arrays and n-dimensional arrays. 1) numpy.dot对二维数组和 n 维数组采用不同的路径。 From the numpy.dot 's online documentation :来自numpy.dot的在线文档：

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation).对于二维数组，它相当于矩阵乘法，对于一维数组，相当于向量的内积（没有复共轭）。 For N dimensions it is a sum product over the last axis of a and the second-to-last of b对于 N 维，它是 a 的最后一个轴和 b 的倒数第二个轴的和积

dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])点(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm , however for ND arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm could be missed out on.因此，对于二维数组，您始终可以保证对 BLAS 的dgemm进行一次调用，但是对于 ND 数组，numpy 可能会为可能与变化最快的轴不对应的数组选择乘法轴（正如您可以从我发布的摘录中看到的） )，因此可能会错过dgemm的全部功能。

2) Your A array is too large to be loaded on to CPU cache. 2)您的A数组太大而无法加载到 CPU 缓存中。 In your example, you use A with dimensions (10,1000,1000) which gives在您的示例中，您使用尺寸为(10,1000,1000) A ，它给出

In [1]: A.nbytes
80000000
In [2]: 80000000/1024
78125

That is almost 80MB , much larger than your cache size.这几乎是80MB ，远大于您的缓存大小。 So again you lose most of dgemm 's power right there.因此，您再次失去了dgemm的大部分功能。

3) You are also timing the functions somewhat imprecisely. 3）您对功能的计时也有些不准确。 The time function in Python is known to be not precise.众所周知，Python 中的time函数并不精确。 Use timeit instead.使用timeit代替。

So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache因此，考虑到上述所有要点，让我们尝试使用可以加载到缓存中的数组

dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)

def for_dot1(A,x):
    for i in range(A.shape[0]):
        np.dot(A[i,:,:], x)

def for_dot2(A,x):
    for i in range(A.shape[0]):
        np.dot(A[:,i,:], x)    

def for_dot3(A,x):
    for i in range(A.shape[0]):
        np.dot(A[:,:,i], x)

and here are the timings that I get (using numpy 1.9.2 built against OpenBLAS 0.2.14 ):这是我得到的时间（使用针对OpenBLAS 0.2.14构建的numpy 1.9.2 ）：

In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop

Notice that there is still a time difference, but not in orders of magnitude.请注意，仍然存在时差，但不是数量级的。 Also note the importance of choosing the axis of multiplication .还要注意choosing the axis of multiplication的重要性。 Now perhaps, a numpy developer can shed some light on what numpy.dot actually does under the hood for ND arrays.现在也许，一个 numpy 开发人员可以阐明numpy.dot实际上在 ND 阵列的引擎盖下做了什么。

为什么 B = numpy.dot(A,x) 通过执行 B[i,:,:] = numpy.dot(A[i,:,:],x) ) 循环慢得多？

问题描述

3 个解决方案

解决方案1
4 已采纳 2015-10-08 03:32:37

解决方案2
4 2018-08-20 18:27:22

解决方案3
2 2015-10-08 03:29:12

为什么 B = numpy.dot(A,x) 通过执行 B[i,:,:] = numpy.dot(A[i,:,:],x) ) 循环慢得多？

问题描述

3 个解决方案

解决方案1 4 已采纳 2015-10-08 03:32:37

解决方案2 4 2018-08-20 18:27:22

解决方案3 2 2015-10-08 03:29:12

解决方案1
4 已采纳 2015-10-08 03:32:37

解决方案2
4 2018-08-20 18:27:22

解决方案3
2 2015-10-08 03:29:12