繁体   English   中英

使用 numpy.einsum 删除循环

[英]removing loops with numpy.einsum

我有一些嵌套循环(总共三个),我试图使用 numpy.einsum 来加速计算,但我正在努力使符号正确。 我设法摆脱了一个循环,但我无法弄清楚另外两个。 这是我到目前为止所得到的:

import numpy as np
import time

def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for qi in range(nq):
            y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))
    return y

r = np.random.random(size=(1000,1000))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))

start = time.time()
y = myfunc(r, q, f)
end = time.time()

print(end-start)

虽然这比原来快得多,但这仍然太慢,大约需要 30 秒。 请注意,没有 einsum 调用的原始内容如下(看起来需要大约 2.5 小时,迫不及待地想确定):

def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for rj in range(nr):
            for qi in range(nq):
                y[qi] += f[ri,qi]*f[rj,qi]*np.sinc(q[qi]*r[ri,rj]/np.pi))
    return y

有谁知道如何使用 einsum 或任何其他工具摆脱这些循环?

您的功能似乎等同于以下内容:

# this is so called broadcasting
s = np.sinc(q * r[...,None]/np.pi)

np.einsum('iq,jq,ijq->q',f,f,s)

这在我的系统上花费了大约 20 秒,大部分时间用于分配s

让我们用一个小样本来测试它:

np.random.seed(1)
r = np.random.random(size=(10,10))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))
(np.abs(np.einsum('iq,jq,ijq->q',f,f,s) - myfunc(r,q,f)) < 1e-6).all()
# True

由于np.sinc不是线性运算符,我不太确定我们如何进一步减少运行时间。

正如@Quang Hoang 的帖子中提到的那样, sinc是实际的瓶颈。 我们将利用那里的einsum表达式来结束这样的一种方式 -

现在,从docsnumpy.sinc(x)是: \\sin(\\pi x)/(\\pi x) 我们将利用它——

v = q*r[...,None]
p = np.sin(v)/v
mask = (q==0) | (r==0)[...,None]
p[mask] = 1
out = np.einsum('iq,jq,ijq->q',f,f,p)

此外,对于大数据,我们可以使用numexpr来利用多核,就像这样 -

import numexpr as ne

p = ne.evaluate('sin(q*r3D)/(q*r3D)', {'r3D':r[...,None]})
mask = (q==0) | (r==0)[...,None]
p[mask] = 1
out = np.einsum('iq,jq,ijq->q',f,f,p)

具有 500 个长度数组的计时 -

In [12]: r = np.random.random(size=(500,500))
    ...: q = np.linspace(0,1,501)
    ...: f = np.random.random(size=(r.shape[0],q.shape[0]))

# Original soln with einsum
In [15]: %%timeit
    ...: nr = r.shape[0]
    ...: nq = q.shape[0]
    ...: y = np.zeros(nq)
    ...: for ri in range(nr):
    ...:     for qi in range(nq):
    ...:         y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))
9.75 s ± 977 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @Quang Hoang's soln
In [16]: %%timeit
    ...: s = np.sinc(q * r[...,None]/np.pi)
    ...: np.einsum('iq,jq,ijq->q',f,f,s)
2.75 s ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]: %%timeit
    ...: p = ne.evaluate('sin(q3D*r)/(q3D*r)', {'q3D':q[:,None,None]})
    ...: mask = (q==0)[:,None,None] | (r==0)
    ...: p[mask] = 1
    ...: out = np.einsum('iq,jq,qij->q',f,f,p)
1.39 s ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %%timeit
    ...: v = q*r[...,None]
    ...: p = np.sin(v)/v
    ...: mask = (q==0) | (r==0)[...,None]
    ...: p[mask] = 1
    ...: out = np.einsum('iq,jq,ijq->q',f,f,p)
2.11 s ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于更大的数据,我们希望numexpr one 的性能更好,只要我们不遇到内存不足的情况。

最简单的方法(可能也是最高效的)是使用编译器,例如 Numba。 由于此函数依赖于sinc函数,因此还要确保您已安装英特尔 SVML

例子

import numpy as np
import numba as nb

@nb.njit(fastmath=True,parallel=False,error_model="numpy",cache=True)
def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for rj in range(nr):
            for qi in range(nq):
                y[qi] += f[ri,qi]*f[rj,qi]*np.sinc(q[qi]*r[ri,rj]/np.pi)
    return y

@nb.njit(fastmath=True,parallel=True,error_model="numpy",cache=True)
def myfunc_opt(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.empty(nq)

    #for contiguous memory access in the loop
    f_T=np.ascontiguousarray(f.T)
    for qi in nb.prange(nq):
        acc=0
        for ri in range(nr):
            for rj in range(nr):
                acc += f_T[qi,ri]*f_T[qi,rj]*np.sinc(q[qi]*r[ri,rj]/np.pi)
        y[qi]=acc
    return y

@nb.njit(fastmath=True,parallel=True,error_model="numpy",cache=True)
def myfunc_opt_2(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.empty(nq)


    f_T=np.ascontiguousarray(f.T)
    for qi in nb.prange(nq):
        acc=0
        for ri in range(nr):
            for rj in range(nr):
                #Test carefully!
                if q[qi]*r[ri,rj]!=0.:
                    acc += f_T[qi,ri]*f_T[qi,rj]*np.sin(q[qi]*r[ri,rj])/(q[qi]*r[ri,rj])
                else:
                    acc += f_T[qi,ri]*f_T[qi,rj]
        y[qi]=acc
    return y

def numpy_func(r, q, f):
    s = np.sinc(q * r[...,None]/np.pi)
    return np.einsum('iq,jq,ijq->q',f,f,s)

小数组的时序

r = np.random.random(size=(500,500))
q = np.linspace(0,1,501)
f = np.random.random(size=(r.shape[0],q.shape[0]))
%timeit y = myfunc(r, q, f)
#765 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt(r, q, f)
#158 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt_2(r, q, f)
#51.5 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit y = numpy_func(r, q, f)
#3.81 s ± 61.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(np.allclose(numpy_func(r, q, f),myfunc(r, q, f)))
#True
print(np.allclose(numpy_func(r, q, f),myfunc_opt(r, q, f)))
#True
print(np.allclose(numpy_func(r, q, f),myfunc_opt_2(r, q, f)))

更大阵列的时序

r = np.random.random(size=(1000,1000))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))
%timeit y = myfunc(r, q, f)
#6.1 s ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt(r, q, f)
#1.26 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt_2(r, q, f)
#397 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我写这个答案是因为我真的从@Quang Hoang 的帖子中学到了很多关于einsum知识,如果我分享我在解决这个问题的过程中的想法,它将加强我的理解。

问题是设计一个适当的einsum操作

y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))

  1. 查看相关数组的形状。 输入: r --> (a,a)q --> (c,)f --> (a,c) 输出: y --> (c,) 从这些形状中,对于q[qi]*r[ri,:] ,必须通过r[...,None]*q类的东西创建一个新的形状数组(a,a,c)

  2. 由于np.sinc不会改变数组的形状,而且对于固定的(ri,qi)f[ri,qi]只是一个数字,我们首先应该考虑什么einsum操作会重现f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi)即,如何获得形状的阵列(a,c)从形状的两个阵列(a,c)(a,a,c) 直观地,它是kl,ikl->il

  3. 对于一对固定的(ri,qi) ,因为f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi)给出了一个形状为(a,c)f的形状是(a,c) ,最终的操作只是il,il->l

根据上面的分析,我们有解决方案:

s = q*r[...,None]/np.pi
res = np.einsum('kl,ikl->il',f,s)
res = np.einsum('il,il->l',f,res)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM