简体   繁体   English

有没有机会让它更快? (numpy.einsum)

[英]Any chance of making this faster? (numpy.einsum)

I'm trying to multiply three arrays (A x B x A), with the dimensions (19000, 3) x (19000, 3, 3) x (19000, 3) so that at the end I'm getting a 1d-array with the size (19000), so I want to multiply only along the last one/two dimensions.我正在尝试将三个 arrays (A x B x A) 与尺寸 (19000, 3) x (19000, 3, 3) x (19000, 3) 相乘,这样最后我得到一个 1d-大小为(19000)的数组,所以我只想沿最后一维/二维相乘。

I've got it working with np.einsum() but I'm wondering if there is any way of making this faster, as this is the bottleneck of my whole code.我已经让它与 np.einsum() 一起工作,但我想知道是否有任何方法可以让它更快,因为这是我整个代码的瓶颈。

np.einsum('...i,...ij,...j', A, B, A)

I've already tried it with two separated np.einsum() calls, but that gave me the same performance:我已经尝试过两次单独的 np.einsum() 调用,但这给了我相同的性能:

np.einsum('...i, ...i', np.einsum('...i,...ij', A, B), A)

As well I've already tried the @ operator and adding some additional axes, but that also didn't make it faster:同样,我已经尝试了 @ 运算符并添加了一些额外的轴,但这也没有让它更快:

(A[:, None]@B@A[...,None]).squeeze()

I've tried to get it working with np.inner(), np.dot(), np.tensordot() and np.vdot(), but these never gave me the same results, so I couldn't compare them.我试图让它与 np.inner()、np.dot()、np.tensordot() 和 np.vdot() 一起工作,但这些从来没有给我相同的结果,所以我无法比较它们。

Any other ideas?还有其他想法吗? Is there any way I could get a better performance?有什么办法可以让我有更好的表现吗?

I've already had a quick look at Numba, but as Numba doesn't support np.einsum() and many other NumPy functions, I would have to rewrite a lot of code.我已经快速了解了 Numba,但由于 Numba 不支持 np.einsum() 和许多其他 NumPy 函数,我将不得不重写很多代码。

You could use Numba你可以使用 Numba

In the beginning it is always a good idea, to look what np.einsum does.一开始,看看 np.einsum 做了什么总是一个好主意。 With optimize==optimal it is usually really good to find a way of contraction, which has less FLOPs.使用optimize==optimal找到一种收缩方式通常非常好,它具有更少的 FLOP。 In this case there is actually only a minor optimization possible and the intermediate array is relatively large (I will stick to the naive version).在这种情况下,实际上只有很小的优化可能,并且中间数组相对较大(我会坚持使用幼稚的版本)。 It should also be mentioned that contractions with very small (fixed?) dimensions are a quite special case.还应该提到的是,尺寸非常小(固定?)的收缩是一种非常特殊的情况。 This is also a reason why it is quite easy to outperfom np.einsum here (unrolling etc..., which a compiler does if it knows that a loop consists only of 3 elements)这也是为什么在这里很容易超越np.einsum的原因(展开等...,如果编译器知道循环仅包含 3 个元素,则它会这样做)

import numpy as np

A=np.random.rand(19000, 3)
B=np.random.rand(19000, 3, 3)

print(np.einsum_path('...i,...ij,...j', A, B, A,optimize="optimal")[1])

"""
  Complete contraction:  si,sij,sj->s
         Naive scaling:  3
     Optimized scaling:  3
      Naive FLOP count:  5.130e+05
  Optimized FLOP count:  4.560e+05
   Theoretical speedup:  1.125
  Largest intermediate:  5.700e+04 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   3                  sij,si->js                                 sj,js->s
   2                    js,sj->s                                     s->s

"""

Numba implementation numba 实现

import numba as nb

#si,sij,sj->s
@nb.njit(fastmath=True,parallel=True,cache=True)
def nb_einsum(A,B):
    #check the input's at the beginning
    #I assume that the asserted shapes are always constant
    #This makes it easier for the compiler to optimize 
    assert A.shape[1]==3
    assert B.shape[1]==3
    assert B.shape[2]==3

    #allocate output
    res=np.empty(A.shape[0],dtype=A.dtype)

    for s in nb.prange(A.shape[0]):
        #Using a syntax like that is also important for performance
        acc=0
        for i in range(3):
            for j in range(3):
                acc+=A[s,i]*B[s,i,j]*A[s,j]
        res[s]=acc
    return res

Timings计时

#warmup the first call is always slower 
#(due to compilation or loading the cached function)
res=nb_einsum(A,B)
%timeit nb_einsum(A,B)
#43.2 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A,optimize=True)
#450 µs ± 8.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A)
#977 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.allclose(np.einsum('...i,...ij,...j', A, B, A,optimize=True),nb_einsum(A,B))
#True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM