简体   繁体   English

使用 Python 进行高效张量收缩

[英]Efficient tensor contraction with Python

I have a piece of code with a bottleneck calculation involving tensor contractions.我有一段代码,其中包含涉及张量收缩的瓶颈计算。 Lets say I want to calculate a tensor A_{i,j,k,l}( X ) whose non-zero entries for a single x\in X are N ~ 10^5, and X represents a grid with M total points, with M~1000 approximately.假设我想计算一个张量 A_{i,j,k,l}( X ),其单个 x\in X 的非零条目为 N ~ 10^5,并且 X 表示具有 M 个总点的网格, M~1000 左右。 For a single element of the tensor A, the rhs of the equation looks something like:对于张量 A 的单个元素,方程的 rhs 类似于:

A_{ijkl}(M) = Sum_{m,n,p,q} S_{i,j, m,n }(M) B_{m,n,p,q}(M) T_{ p,q, k,l }(M) A_{ijkl}(M) = Sum_{m,n,p,q} S_{i,j, m,n }(M) B_{m,n,p,q}(M) T_{ p,q, k,l }(M)

In addition, the middle tensor B_{m,n,p,q}(M) is obtained by numerical convolution of arrays so that:此外,中间张量 B_{m,n,p,q}(M) 通过 arrays 的数值卷积得到:

B_{m,n,p,q}(M) = ( L_{m,n} * F_{p,q} )(M) B_{m,n,p,q}(M) = ( L_{m,n} * F_{p,q} )(M)

where "*" is the convolution operator, and all tensors have appoximately the same number of elements as A. My problem has to do with efficiency of the sums;其中“*”是卷积算子,所有张量的元素数量都与 A 大致相同。我的问题与求和的效率有关; to compute a single rhs of A, it takes very long times given the complexity of the problem.考虑到问题的复杂性,计算 A 的单个 rhs 需要很长时间。 I have a "keys" system, where each tensor element is accessed by its unique key combination ( ( p,q,k,l ) for T for example ) taken from a dictionary.我有一个“键”系统,其中每个张量元素都通过从字典中获取的唯一键组合(例如 T 的 ( p,q,k,l ) )访问。 Then the dictionary for that specific key gives the Numpy array associated to that key to perform an operation, and all operations (convolutions, multiplications...) are done using Numpy.然后该特定键的字典给出与该键关联的 Numpy 数组以执行操作,并且所有操作(卷积、乘法...)都使用 Numpy 完成。 I have seen that the most time consuming part is actually due to the nested loop (I loop over all keys (i,j,k,l) of the A tensor, and for each key, a rhs like the one above needs to be computed).我已经看到最耗时的部分实际上是由于嵌套循环(我循环A张量的所有键(i,j,k,l),并且对于每个键,需要像上面那样的rhs计算)。 Is there any efficient way to do this?有没有有效的方法来做到这一点? Consider that:考虑一下:

1) Using simple numpy arrays of 4 +1 D results in high memory usage, since all tensors are of type complex 2 ) I have tried several approaches: Numba is quite limited when working with dictionaries, and some important Numpy features that I need are not currently supported. 1) Using simple numpy arrays of 4 +1 D results in high memory usage, since all tensors are of type complex 2 ) I have tried several approaches: Numba is quite limited when working with dictionaries, and some important Numpy features that I need are目前不支持。 For instance, the numpy.convolve() only takes the first 2 arguments, but does not take the "mode" argument which reduces considerably the needed convolution interval in this case, I dont need the "full" output of the convolution例如,numpy.convolve() 只采用前 2 个 arguments,但没有采用“模式”参数,在这种情况下大大减少了所需的卷积间隔,我不需要卷积的“完整”Z78E6221F6393D13568D13568DF681DB963

3) My most recent approach is trying to implement everything using Cython for this part... But this is quite time consuming as well as more error prone given the logic of the code. 3)我最近的方法是尝试使用 Cython 来实现这部分的所有内容......但是考虑到代码的逻辑,这非常耗时并且更容易出错。

Any ideas on how to deal with such complexity using Python?关于如何使用 Python 处理这种复杂性的任何想法?

Thanks!谢谢!

You have to make your question a bit more precise, which also includes a working code example which you have already tried.您必须使您的问题更加精确,其中还包括您已经尝试过的工作代码示例。 It is for example unclear, why you use dictionarys in this tensor contractions.例如,不清楚为什么在这种张量收缩中使用字典。 Dictionary lookups looks to be a weard thing for this calculation, but maybe I didn't get the point what you really want to do.对于这个计算,字典查找看起来是一件很麻烦的事情,但也许我没有明白你真正想要做什么。

Tensor contraction actually is very easy to implement in Python (Numpy), there are methods to find the best way to contract the tensors and they are really easy to use (np.einsum).张量收缩实际上在 Python (Numpy) 中很容易实现,有一些方法可以找到收缩张量的最佳方法,而且它们真的很容易使用 (np.einsum)。

Creating some data (this should be part of the question)创建一些数据(这应该是问题的一部分)

import numpy as np
import time

i=20
j=20
k=20
l=20

m=20
n=20
p=20
q=20

#I don't know what complex 2 means, I assume it is complex128 (real and imaginary part are in float64)

#size of all arrays is 1.6e5
Sum_=np.random.rand(m,n,p,q).astype(np.complex128)
S_=np.random.rand(i,j,m,n).astype(np.complex128)
B_=np.random.rand(m,n,p,q).astype(np.complex128)
T_=np.random.rand(p,q,k,l).astype(np.complex128)

The naive way天真的方式

This code is basically the same as writing it in loops using Cython or Numba without calling BLAS routines (ZGEMM) or optimizing the contraction order -> 8 nested loops to do the job.此代码与使用 Cython 或 Numba 在循环中编写它基本相同,无需调用 BLAS 例程 (ZGEMM) 或优化收缩顺序 -> 8 个嵌套循环来完成这项工作。

t1=time.time()
A=np.einsum("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_)
print(time.time()-t1)

This results in a very slow runtime of about 330 seconds.这导致运行时间非常慢,大约为 330 秒。

How to increase the speed by a factor of 7700如何将速度提高 7700 倍

%timeit A=np.einsum("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal")
#42.9 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is this so much faster?为什么这么快?

Lets have a look at the contraction path and the internals.让我们看看收缩路径和内部结构。

path=np.einsum_path("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal")
print(path[1])

    #  Complete contraction:  mnpq,ijmn,mnpq,pqkl->ijkl
#         Naive scaling:  8
#     Optimized scaling:  6
#      Naive FLOP count:  1.024e+11
#  Optimized FLOP count:  2.562e+08
#   Theoretical speedup:  399.750
#  Largest intermediate:  1.600e+05 elements
#--------------------------------------------------------------------------
#scaling                  current                                remaining
#--------------------------------------------------------------------------
#   4             mnpq,mnpq->mnpq                     ijmn,pqkl,mnpq->ijkl
#   6             mnpq,ijmn->ijpq                          pqkl,ijpq->ijkl
#   6             ijpq,pqkl->ijkl                               ijkl->ijkl

and

path=np.einsum_path("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal",einsum_call=True)
print(path[1])

#[((2, 0), set(), 'mnpq,mnpq->mnpq', ['ijmn', 'pqkl', 'mnpq'], False), ((2, 0), {'n', 'm'}, 'mnpq,ijmn->ijpq', ['pqkl', 'ijpq'], True), ((1, 0), {'p', 'q'}, 'ijpq,pqkl->ijkl', ['ijkl'], True)]

Doing the contraction in multiple well choosen steps reduces the required flops by a factor of 400. But thats not the only thing what einsum does here.在多个精心选择的步骤中进行收缩可将所需的触发器减少 400 倍。但这并不是 einsum 在这里所做的唯一事情。 Just have a look at 'mnpq,ijmn->ijpq', ['pqkl', 'ijpq'], True), ((1, 0) the True stands for a BLAS contraction -> tensordot call -> (matrix matix multiplication).看看'mnpq,ijmn->ijpq', ['pqkl', 'ijpq'], True), ((1, 0) True 代表 BLAS 收缩 -> tensordot call -> (matrix matix multiplication )。

Internally this looks basically as follows:在内部,这看起来基本上如下:

#consider X as a 4th order tensor {mnpq}
#consider Y as a 4th order tensor {ijmn}

X_=X.reshape(m*n,p*q)       #-> just another view on the data (2D), costs almost nothing (no copy, just a view)
Y_=Y.reshape(i*j,m*n)       #-> just another view on the data (2D), costs almost nothing (no copy, just a view)
res=np.dot(Y_,X_)           #-> dot is just a wrapper for highly optimized BLAS functions, in case of complex128 ZGEMM
output=res.reshape(i,j,p,q) #-> just another view on the data (4D), costs almost nothing (no copy, just a view)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM