为什么numpy的einsum比numpy的内置函数慢？

Question

I've usually gotten good performance out of numpy's einsum function (and I like it's syntax). 我通常从numpy的einsum函数中获得了很好的表现（我喜欢它的语法）。 @Ophion's answer to this question shows that - for the cases tested - einsum consistently outperforms the "built-in" functions (sometimes by a little, sometimes by a lot). @Ophion对这个问题的回答表明 - 对于测试的案例 - einsum始终优于“内置”功能（有时候会有一些，有时会很多）。 But I just encountered a case where einsum is much slower. 但我刚遇到一个einsum慢得多的情况。 Consider the following equivalent functions: 考虑以下等效函数：

(M, K) = (1000000, 20)
C = np.random.rand(K, K)
X = np.random.rand(M, K)

def func_dot(C, X):
    Y = X.dot(C)
    return np.sum(Y * X, axis=1)

def func_einsum(C, X):
    return np.einsum('ik,km,im->i', X, C, X)

def func_einsum2(C, X):
    # Like func_einsum but break it into two steps.
    A = np.einsum('ik,km', X, C)
    return np.einsum('ik,ik->i', A, X)

I expected func_einsum to run fastest but that is not what I encounter. 我希望func_einsum运行得最快，但这不是我遇到的。 Running on a quad-core cpu with hyperthreading, numpy version 1.9.0.dev-7ae0206, and multithreading with OpenBLAS, I get the following results: 在具有超线程，numpy版本1.9.0.dev-7ae0206的四核CPU上运行，以及使用OpenBLAS进行多线程处理，我得到以下结果：

In [2]: %time y1 = func_dot(C, X)
CPU times: user 320 ms, sys: 312 ms, total: 632 ms
Wall time: 209 ms
In [3]: %time y2 = func_einsum(C, X)
CPU times: user 844 ms, sys: 0 ns, total: 844 ms
Wall time: 842 ms
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 292 ms, sys: 44 ms, total: 336 ms
Wall time: 334 ms

When I increase K to 200, the differences are more extreme: 当我将K增加到200时，差异更加极端：

In [2]: %time y1= func_dot(C, X)
CPU times: user 4.5 s, sys: 1.02 s, total: 5.52 s
Wall time: 2.3 s
In [3]: %time y2= func_einsum(C, X)
CPU times: user 1min 16s, sys: 44 ms, total: 1min 16s
Wall time: 1min 16s
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 15.3 s, sys: 312 ms, total: 15.6 s
Wall time: 15.6 s

Can someone explain why einsum is so much slower here? 有人能解释为什么einsum这么慢吗？

If it matters, here is my numpy config: 如果重要，这是我的numpy配置：

In [6]: np.show_config()
lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
atlas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = c
    include_dirs = ['/usr/local/include']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
atlas_blas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = f77
    include_dirs = ['/usr/local/include']
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

Answer 1

You can have the best of both worlds: 你可以充分利用这两个方面：

def func_dot_einsum(C, X):
    Y = X.dot(C)
    return np.einsum('ij,ij->i', Y, X)

On my system: 在我的系统上：

In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop

In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop

In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop

In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop

When available, np.dot uses BLAS, MKL, or whatever library you have . 如果可用， np.dot使用BLAS，MKL或您拥有的任何库。 So the call to np.dot is almost certainly being multithreaded. 所以对np.dot的调用几乎肯定是多线程的。 np.einsum has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation. np.einsum有自己的循环，因此不使用任何这些优化，除了它自己使用SIMD来加速通过vanilla C实现。

Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. 然后是多输入einsum调用运行得慢得多...... einsum的numpy源非常复杂，我不完全理解它。 So be advised that the following is speculative at best, but here's what I think is going on... 所以请注意以下内容充其量只是推测，但这是我认为正在发生的事情......

When you run something like np.einsum('ij,ij->i', a, b) , the benefit over doing np.sum(a*b, axis=1) comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. 当你运行像np.einsum('ij,ij->i', a, b) ，做np.sum(a*b, axis=1)的好处来自于避免必须用所有实例化中间数组产品，并在其上循环两次。 So at the low level what goes on is something like: 所以在低级别发生的事情是这样的：

for i in range(I):
    out[i] = 0
    for j in range(J):
        out[i] += a[i, j] * b[i, j]

Say now that you are after something like: 现在说你正在追求类似的东西：

np.einsum('ij,jk,ik->i', a, b, c)

You could do the same operation as 你可以做同样的操作

np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))

And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster: 而我认为einsum所做的就是运行这个最后的代码，而不必实例化巨大的中间数组，这肯定会让事情变得更快：

In [29]: a, b, c = np.random.rand(3, 100, 100)

In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop

In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop

But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. 但是如果你仔细看一下，摆脱中间存储可能是一件可怕的事情。 This is what I think einsum is doing at the low level: 这就是我认为einsum在低级别做的事情：

for i in range(I):
    out[i] = 0
    for j in range(J):
        for k in range(K):
            out[i] += a[i, j] * b[j, k] * c[i, k]

But you are repeating a ton of operations! 但是你正在重复大量的操作！ If you instead did: 如果您改为：

for i in range(I):
    out[i] = 0
    for j in range(J):
        temp = 0
        for k in range(K):
            temp += b[j, k] * c[i, k]
        out[i] += a[i, j] * temp

you would be doing I * J * (K-1) less multiplications (and I * J extra additions), and save yourself a ton of time. 你会做I * J * (K-1)减少乘法（和I * J额外增加），并节省你自己很多时间。 My guess is that einsum is not smart enough to optimize things at this level. 我的猜测是，einsum不够智能，不能在这个级别上优化事物。 In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple... 在源代码中有一个提示，它只用1或2个操作数优化操作，而不是3.在任何情况下，为一般输入自动执行此操作似乎只是简单...

Answer 2

einsum has a specialized case for '2 operands, ndim=2'. einsum有一个'2操作数，ndim = 2'的专门案例。 In this case there are 3 operands, and a total of 3 dimensions. 在这种情况下，有3个操作数，总共3个维度。 So it has to use a general nditer . 所以它必须使用一般的nditer 。

While trying to understand how the string input is parsed, I wrote a pure Python einsum simulator, https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py 在尝试理解如何解析字符串输入时，我编写了一个纯Python einsum模拟器， https：//github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py

The (stripped down) einsum and sum-of-products functions are: （剥离的）einsum和产品总和函数是：

def myeinsum(subscripts, *ops, **kwargs):
    # dropin preplacement for np.einsum (more or less)
    <parse subscript strings>
    <prepare op_axes>
    x = sum_of_prod(ops, op_axes, **kwargs)
    return x

def sum_of_prod(ops, op_axes,...):
    ...
    it = np.nditer(ops, flags, op_flags, op_axes)
    it.operands[nop][...] = 0
    it.reset()
    for (x,y,z,w) in it:
        w[...] += x*y*z
    return it.operands[nop]

Debugging output for myeinsum('ik,km,im->i',X,C,X,debug=True) with (M,K)=(10,5) 使用(M,K)=(10,5)调试myeinsum('ik,km,im->i',X,C,X,debug=True)输出myeinsum('ik,km,im->i',X,C,X,debug=True) (M,K)=(10,5)

{'max_label': 109, 
 'min_label': 105, 
 'nop': 3, 
 'shapes': [(10, 5), (5, 5), (10, 5)], 
 ....}}
 ...
iter labels: [105, 107, 109],'ikm'

op_axes [[0, 1, -1], [-1, 0, 1], [0, -1, 1], [0, -1, -1]]

If you write a sum-of-prod function like this in cython you should get something close to the generalized einsum . 如果你在cython编写这样的sum-of-prod函数，你应该得到一些接近广义einsum 。

With the full (M,K) , this simulated einsum is 6-7x slower. 使用满(M,K) ，这个模拟的einsum慢6-7倍。

Some timings building on the other answers: 一些时间建立在其他答案上：

In [84]: timeit np.dot(X,C)
1 loops, best of 3: 781 ms per loop

In [85]: timeit np.einsum('ik,km->im',X,C)
1 loops, best of 3: 1.28 s per loop

In [86]: timeit np.einsum('im,im->i',A,X)
10 loops, best of 3: 163 ms per loop

This 'im,im->i' step is substantially faster than the other. The sum dimension, 这个'im,im->i' step is substantially faster than the other. The sum dimension, 'im,im->i' step is substantially faster than the other. The sum dimension, m is only 20. I suspect einsum` is treating this as a special case. 'im,im->i' step is substantially faster than the other. The sum dimension, m is only 20. I suspect einsum`将此视为特殊情况。

In [87]: timeit np.einsum('im,im->i',np.dot(X,C),X)
1 loops, best of 3: 950 ms per loop

In [88]: timeit np.einsum('im,im->i',np.einsum('ik,km->im',X,C),X)
1 loops, best of 3: 1.45 s per loop

The times for these composite calculations are simply sums of the corresponding pieces. 这些复合计算的时间只是相应部分的总和。

为什么numpy的einsum比numpy的内置函数慢？

问题描述

2 个解决方案

解决方案1
17 已采纳 2013-11-22 20:00:14

解决方案2
4 2013-11-23 00:10:07

为什么numpy的einsum比numpy的内置函数慢？

问题描述

2 个解决方案

解决方案1 17 已采纳 2013-11-22 20:00:14

解决方案2 4 2013-11-23 00:10:07

解决方案1
17 已采纳 2013-11-22 20:00:14

解决方案2
4 2013-11-23 00:10:07