python中FFT的循环加速（使用`np.einsum`）

Question

Problem: I want to speed up my python loop containing a lot of products and summations with np.einsum , but I'm also open to any other solutions.问题：我想用np.einsum加速包含大量乘积和求和的 python 循环，但我也愿意接受任何其他解决方案。

My function takes an vector configuration S of shape (n,n,3) (my case: n=72) and does a Fourier-Transformation on the correlation function for N*N points.我的函数采用形状为 (n,n,3) 的向量配置 S（我的情况：n=72）并对 N*N 点的相关函数进行傅立叶变换。 The correlation function is defined as the product of every vector with every other.相关函数定义为每个向量与其他向量的乘积。 This gets multiplied by a cosine function of the postions of vectors times the kx and ky values.这乘以向量位置乘以 kx 和 ky 值的余弦函数。 Every position i,j is in the end summed to get one point in k-space p,m :每个位置i,j最后相加得到 k 空间p,m一个点：

def spin_spin(S,N):
    n= len(S)
    conf = np.reshape(S,(n**2,3))
    chi = np.zeros((N,N))
    kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
    ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)

    x=np.reshape(triangular(n)[0],(n**2))
    y=np.reshape(triangular(n)[1],(n**2))
    for p in range(N):
        for m in range(N):
            for i in range(n**2):
                for j in range(n**2):        
                    chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
    return(chi,kx,ky)

My problem is that I need roughly 100*100 points which are denoted by kx*ky and the loop needs to many hours to finish this job for a lattice with 72*72 vectors.我的问题是我需要大约 100*100 个点，由 kx*ky 表示，并且循环需要很多小时才能完成具有 72*72 向量的格子的这项工作。 Number of calculations: 72*72*72*72*100*100 I cannot use the built-in FFT of numpy , because of my triangular grid, so I need some other option to reduce here the computional cost.计算次数：72*72*72*72*100*100 由于我的三角形网格，我无法使用numpy的内置 FFT，所以我需要一些其他选项来减少计算成本。

My idea: First I recognized that reshaping the configuration into a list of vectors instead of a matrix reduces the computational cost.我的想法：首先，我认识到将配置重塑为向量列表而不是矩阵可以降低计算成本。 Furthermore I used the numba package, which also has reduced the cost, but its still too slow.另外我用了numba包，也降低了成本，但是还是太慢了。 I found out that a good way of calculating these kind of objects is the np.einsum function.我发现计算这些对象的一个好方法是np.einsum函数。 Calculating the product of every vector with every vector is done with the following:计算每个向量与每个向量的乘积是通过以下方式完成的：

np.einsum('ij,kj -> ik',np.reshape(S,(72**2,3)),np.reshape(S,(72**2,3)))

The tricky part is the calculation of the term inside the np.cos .棘手的部分是np.cos术语的计算。 Here I want to caclulate the product between a list of shape (100,1) with the positions of the vectors (eg np.shape(x)=(72**2,1) ).在这里，我想计算形状列表 (100,1) 与向量位置之间的乘积（例如np.shape(x)=(72**2,1) ）。 Especially I really dont know how to implement the distance in x-direction and y-direction with np.einsum .特别是我真的不知道如何用np.einsum实现 x 方向和 y 方向的np.einsum 。

To reproduce the code (Probably you won't need this): First you need a vector configuration.重现代码（可能你不需要这个）：首先你需要一个向量配置。 You can do it simply with np.ones((72,72,3) or you take random vectors as example with:您可以简单地使用np.ones((72,72,3)或以随机向量为例：

def spherical_to_cartesian(r, theta, phi):
    '''Convert spherical coordinates (physics convention) to cartesian coordinates'''
    sin_theta = np.sin(theta)
    x = r * sin_theta * np.cos(phi)
    y = r * sin_theta * np.sin(phi)
    z = r * np.cos(theta)

    return x, y, z # return a tuple

def random_directions(n, r):
    '''Return ``n`` 3-vectors in random directions with radius ``r``'''
    out = np.empty(shape=(n,3), dtype=np.float64)

    for i in range(n):
        # Pick directions randomly in solid angle
        phi = random.uniform(0, 2*np.pi)
        theta = np.arccos(random.uniform(-1, 1))
        # unpack a tuple
        x, y, z = spherical_to_cartesian(r, theta, phi)
        out[i] = x, y, z

    return out
S = np.reshape(random_directions(72**2,1),(72,72,3))

(The reshape in this example is needed to shape it in the function spin_spin back to the (72**2,3) shape.) （本例中的 reshape 需要在函数spin_spin中将其spin_spin回 (72**2,3) 形状。）

For the positions of vectors I use a triangular grid defined by对于向量的位置，我使用由以下定义的三角形网格

def triangular(nsize):
    '''Positional arguments of the spin configuration'''

    X=np.zeros((nsize,nsize))
    Y=np.zeros((nsize,nsize))
    for i in range(nsize):
        for j in range(nsize):
            X[i,j]+=1/2*j+i
            Y[i,j]+=np.sqrt(3)/2*j
    return(X,Y)

Answer 1

Optimized Numba implementation优化的 Numba 实现

The main problem in your code is calling external BLAS function np.dot repeatedly with extremely small data.在你的代码的主要问题是调用外部BLAS功能np.dot极小的数据反复进行。 In this code it would make more sense to calculate them only once, but if you have to do this calculations in a loop write a Numba implementation.在这段代码中，只计算一次会更有意义，但是如果您必须在循环中进行计算，请编写一个 Numba 实现。 Example例子

Optimized function (brute-force)优化功能（蛮力）

import numpy as np
import numba as nb

@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin(S,N):
    n= len(S)
    conf = np.reshape(S,(n**2,3))
    chi = np.zeros((N,N))
    kx = np.linspace(-5*np.pi/3,5*np.pi/3,N).astype(np.float32)
    ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N).astype(np.float32)

    x=np.reshape(triangular(n)[0],(n**2)).astype(np.float32)
    y=np.reshape(triangular(n)[1],(n**2)).astype(np.float32)

    #precalc some values
    fact=nb.float32(2/(n**2))
    conf_dot=np.dot(conf,conf.T).astype(np.float32)

    for p in nb.prange(N):
        for m in range(N):
            #accumulating on a scalar is often beneficial
            acc=nb.float32(0)
            for i in range(n**2):
                for j in range(n**2):        
                    acc+= conf_dot[i,j]*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
            chi[p,m]=fact*acc

    return(chi,kx,ky)

Optimized function (removing of redundant calculations)优化功能（去除冗余计算）

There are a lot of redundant calculations done.做了很多多余的计算。 This is an example on how to remove them.这是有关如何删除它们的示例。 This is also a version which does the calculations in double precision.这也是一个以双精度进行计算的版本。

@nb.njit()
def precalc(S):
    #There may not be all redundancies removed
    n= len(S)
    conf = np.reshape(S,(n**2,3))
    conf_dot=np.dot(conf,conf.T)
    x=np.reshape(triangular(n)[0],(n**2))
    y=np.reshape(triangular(n)[1],(n**2))

    x_s=set()
    y_s=set()
    for i in range(n**2):
        for j in range(n**2):
            x_s.add((x[i]-x[j]))
            y_s.add((y[i]-y[j]))

    x_arr=np.sort(np.array(list(x_s)))
    y_arr=np.sort(np.array(list(y_s)))


    conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
    for i in range(n**2):
        for j in range(n**2):
            ii=np.searchsorted(x_arr,x[i]-x[j])
            jj=np.searchsorted(y_arr,y[i]-y[j])
            conf_dot_sel[ii,jj]+=conf_dot[i,j]

    return x_arr,y_arr,conf_dot_sel

@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
    chi = np.empty((N,N))
    n= len(S)

    kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
    ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)

    x_arr,y_arr,conf_dot_sel=precalc(S)
    fact=2/(n**2)
    for p in nb.prange(N):
        for m in range(N):
            acc=nb.float32(0)
            for i in range(x_arr.shape[0]):
                for j in range(y_arr.shape[0]):        
                    acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
            chi[p,m]=acc

    return(chi,kx,ky)

@nb.njit()
def precalc(S):
    #There may not be all redundancies removed
    n= len(S)
    conf = np.reshape(S,(n**2,3))
    conf_dot=np.dot(conf,conf.T)
    x=np.reshape(triangular(n)[0],(n**2))
    y=np.reshape(triangular(n)[1],(n**2))

    x_s=set()
    y_s=set()
    for i in range(n**2):
        for j in range(n**2):
            x_s.add((x[i]-x[j]))
            y_s.add((y[i]-y[j]))

    x_arr=np.sort(np.array(list(x_s)))
    y_arr=np.sort(np.array(list(y_s)))


    conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
    for i in range(n**2):
        for j in range(n**2):
            ii=np.searchsorted(x_arr,x[i]-x[j])
            jj=np.searchsorted(y_arr,y[i]-y[j])
            conf_dot_sel[ii,jj]+=conf_dot[i,j]

    return x_arr,y_arr,conf_dot_sel

@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
    chi = np.empty((N,N))
    n= len(S)

    kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
    ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)

    x_arr,y_arr,conf_dot_sel=precalc(S)
    fact=2/(n**2)
    for p in nb.prange(N):
        for m in range(N):
            acc=nb.float32(0)
            for i in range(x_arr.shape[0]):
                for j in range(y_arr.shape[0]):        
                    acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
            chi[p,m]=acc

    return(chi,kx,ky)

Timings时间安排

#brute-force
%timeit res=spin_spin(S,100)
#48 s ± 671 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#new version
%timeit res_2=spin_spin_opt_2(S,100)
#5.33 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit res_2=spin_spin_opt_2(S,1000)
#1min 23s ± 2.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit (SVML-check)编辑（SVML 检查）

import numba as nb
import numpy as np

@nb.njit(fastmath=True)
def foo(n):
    x   = np.empty(n*8, dtype=np.float64)
    ret = np.empty_like(x)
    for i in range(ret.size):
            ret[i] += np.cos(x[i])
    return ret

foo(1000)

if 'intel_svmlcc' in foo.inspect_llvm(foo.signatures[0]):
    print("found")
else:
    print("not found")

#found

If there is a not found read this link.如果not found阅读此链接。 It should work on Linux and Windows, but I haven't tested it on macOS.它应该可以在 Linux 和 Windows 上运行，但我还没有在 macOS 上测试过。

Answer 2

Here is one approach to speed things up.这是加快速度的一种方法。 I didn't start using np.einsum because a little tweaking of your loops was sufficient.我没有开始使用 np.einsum 因为稍微调整你的循环就足够了。

The main thing slowing down your code was redundant recalculations of the same thing.减慢代码速度的主要因素是对同一事物进行多余的重新计算。 The nested loop here is the perpetrator:这里的嵌套循环是肇事者：

for p in range(N):
        for m in range(N):
            for i in range(n**2):
                for j in range(n**2):        
                    chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))

It contains a lot of redundancy, recalculating vector operations many times.它包含大量冗余，多次重新计算向量操作。

Consider the np.dot(...) : this calculation is completely independent of the points kx and ky.考虑np.dot(...) ：这个计算完全独立于点 kx 和 ky。 But only the points kx and ky required indexing with m and n.但只有点 kx 和 ky 需要用 m 和 n 进行索引。 So you can run the dot products over all i and j just once, and save the result, as opposed to recalculating for each m,n (which would be 10,000 times!).因此，您可以对所有 i 和 j 运行一次点积，并保存结果，而不是为每个 m,n 重新计算（这将是 10,000 次！）。

In a similar approach, no need for the vector differences between to be recalculated at each point in the lattice.在类似的方法中，不需要在晶格中的每个点重新计算向量之间的差异。 At every point you calculate every vector distance, when all that is needed is to calculate the vector distances once and merely multiply this result by each lattice point.在每个点计算每个向量距离时，只需要计算一次向量距离，然后将该结果乘以每个格点即可。

So, having fixed the loops and used dictionaries with indices (i,j) as keys to store all the values, you can just look up the relevant value during the loop over i, j.因此，修复了循环并使用带有索引 (i,j) 的字典作为存储所有值的键后，您可以在 i, j 的循环期间查找相关值。 Here is my code:这是我的代码：

def spin_spin(S, N):
    n = len(S)
    conf = np.reshape(S,(n**2, 3))

    chi = np.zeros((N, N))
    kx = np.linspace(-5*np.pi/3, 5*np.pi/3, N)
    ky = np.linspace(-3*np.pi/np.sqrt(3), 3*np.pi/np.sqrt(3), N)

    # Minor point; no need to use triangular twice
    x, y = triangular(n)
    x, y = np.reshape(x,(n**2)), np.reshape(y,(n**2))

    # Build a look-up for all the dot products to save calculating them many times
    dot_prods = dict()
    x_diffs, y_diffs = dict(), dict()
    for i, j in itertools.product(range(n**2), range(n**2)):
        dot_prods[(i, j)] = np.dot(conf[i], conf[j])
        x_diffs[(i, j)], y_diffs[(i, j)] = x[i] - x[j], y[i] - y[j]    

    # Minor point; improve syntax by converting nested for loops to one line
    for p, m in itertools.product(range(N), range(N)):
        for i, j in itertools.product(range(n**2), range(n**2)):
            # All vector operations are replaced by look ups to the dictionaries defined above
            chi[p, m] += 2/(n**2)*dot_prods[(i, j)]*np.cos(kx[p]*(x_diffs[(i, j)]) + ky[m]*(y_diffs[(i, j)]))
    return(chi, kx, ky)

I am running this at the moment with the dimensions you provide, on a decent machine, and the loop over i,j finishes in two minutes.我目前正在使用您提供的尺寸在一台不错的机器上运行它，并且 i,j 上的循环在两分钟内完成。 That only needs to happen once;这只需要发生一次； then it is just a loop over m, n.那么它只是在 m, n 上的一个循环。 Each one of these is taking about 90 seconds, so still a 2-3 hour run time.每一个都需要大约 90 秒，所以仍然是 2-3 小时的运行时间。 I welcome any suggestions on how to optimise that cos calculation to speed that up!我欢迎任何关于如何优化 cos 计算以加快速度的建议！

I hit the low hanging fruit of optimization, but to give a sense of speed, the loop of i, j takes 2 minutes, and this way it runs 9,999 fewer times!我击中了优化的悬而未决的成果，但为了给人一种速度感，i，j 的循环需要 2 分钟，这样它运行的次数减少了 9,999 次！

python中FFT的循环加速（使用`np.einsum`）

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-03-31 19:45:56

解决方案2
1 2020-03-30 18:37:26

python中FFT的循环加速（使用`np.einsum`）

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-03-31 19:45:56

解决方案2 1 2020-03-30 18:37:26

解决方案1
3 已采纳 2020-03-31 19:45:56

解决方案2
1 2020-03-30 18:37:26