多个元素的点积，无循环和内存错误

Question

I am dealing with a simple question with numpy. 我正在用numpy处理一个简单的问题。 I have two lists of matrices - say A,B - encoded as 3D arrays with shapes (n,p,q) and (n,q,r) respectively. 我有两个矩阵列表-例如A,B编码为形状分别为(n,p,q)和(n,q,r) 3D数组。

I want to compute their element-wise dot product, that is a 3D-array C such that C[i,j,l] = sum A[i,j,:] B[i,:,l] . 我想计算他们的逐元素点积，即3D数组C这样C[i,j,l] = sum A[i,j,:] B[i,:,l] 。 This is very simple mathematically speaking, but here are the rules I must follow: 从数学上来说，这非常简单，但是我必须遵循以下规则：

1) I must only use numpy functions ( dot , tensordot , einsum , etc.): no loop & cie. 1）我只能使用numpy函数（ dot ， tensordot ， einsum等）：没有循环＆cie。 This is because I want this to work on my gpu (with cupy) and loops are awful on it. 这是因为我希望它能在我的gpu（带有cupy）上工作，并且循环很糟糕。 I want all operations to be made on the current device. 我希望所有操作都在当前设备上进行。

2) Since my data can be quite large, typically A and B already take few dozens of Mb in memory, I don't want to build any items with bigger shapes than (n,p,q),(n,q,r),(n,p,r) (no intermediate 4D arrays must be store). 2）由于我的数据可能很大，通常A和B已经占用了几十Mb的内存，所以我不想构建形状大于(n,p,q),(n,q,r),(n,p,r) （不必存储任何中间4D数组）。

For example, the solution I have found there , that is using: 例如，我在那里找到的解决方案正在使用：

C = np.sum(np.transpose(A,(0,2,1)).reshape(n,p,q,1)*B.reshape(n,q,1,r),-3)

is correct mathematically speaking, but implies the intermediate creation of a (n,p,q,r) array which is too big for my purpose. 从数学上讲是正确的，但是它隐含了中间创建（n，p，q，r）数组的过程，这个数组对于我的目的来说太大了。

I had similar trouble with something like 我遇到类似的麻烦

C = np.einsum('ipq,iqr->ipr',A,B)

I don't know what are the underlying operations & constructions, but it always leads to a memory error. 我不知道底层的操作和构造是什么，但是它总是会导致内存错误。

On the other hand, something a bit naive like: 另一方面，有些天真，例如：

C = np.array([A[i].dot(B[i]) for i in range(n)])

seems ok in terms of memory but is not efficient on my gpu: the list is build on the CPU it seems, and re-allocating it to gpu is slow (if there is a cupy-friendly way to write that, it would be a nice solution!) 就内存而言似乎不错，但在我的gpu上效率不高：列表似乎是在CPU上构建的，将其重新分配给gpu的速度很慢（如果有一种很友好的方式编写它，那将是一个不错的解决方案！）

Thank you for your help ! 谢谢您的帮助！

Answer 1

You want numpy.matmul ( cupy version here ). 您需要numpy.matmul （此处为cupy版本）。 matmul is a "broadcasting" matrix multiply. matmul是一个“广播”矩阵乘法。

I think folks have known that the numpy.dot semantics are wonky and that a broadcasting matrix multiply was needed, but there wasn't much momentum to introduce the change until python got the @ operator. 我认为人们已经知道numpy.dot语义很numpy.dot ，并且需要广播矩阵乘法，但是直到python得到@运算符之前，引入该更改的动力并不大。 I don't see dot going anywhere, but I suspect the better semantics and the ease of doing A @ B will mean that dot will fall out of favor as folks discover the new function and operator. 我看不到dot到任何地方，但是我怀疑更好的语义和A @ B的易用性意味着随着人们发现新的函数和运算符， dot不再受欢迎。

Answer 2

The iterative method that you seek to avoid might not be so bad. 您尝试避免的迭代方法可能还不错。 Consider, for example, these timings: 例如，考虑以下时间：

In [51]: A = np.ones((100,10,10))
In [52]: timeit np.array([A[i].dot(A[i]) for i in range(A.shape[0])])
439 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [53]: timeit np.einsum('ipq,iqr->ipr',A,A)
428 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [54]: timeit A@A
426 µs ± 54.6 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For this case all three take about the same time. 在这种情况下，这三个时间都差不多。

But I double the later dimensions, the iterative approach is actually faster: 但是我将后面的维度加倍，实际上迭代方法更快：

In [55]: A = np.ones((100,20,20))
In [56]: timeit np.array([A[i].dot(A[i]) for i in range(A.shape[0])])
702 µs ± 1.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [57]: timeit np.einsum('ipq,iqr->ipr',A,A)
1.89 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [58]: timeit A@A
1.89 ms ± 490 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The same pattern holds when I change the 20 to 30 and 40. I'm mildly surprised that matmul times match einsum so closely. 当我将20更改为30和40时，也保持相同的模式。令我感到惊讶的是， matmul时间与einsum如此接近。

I suppose I could try pushing these to memory limits. 我想我可以尝试将它们推到内存极限。 I don't have a fancy backend to test that aspect. 我没有想要测试该方面的后端。

A modest number of iterations over a large problem isn't so horrible, once you take into account memory management issues. 一旦考虑了内存管理问题，针对一个大问题进行的适度迭代就不会太可怕了。 The thing you want avoid, in numpy, is many iterations over a simple task. 您想要避免的事情是numpy，它涉及一个简单任务的多次迭代。

多个元素的点积，无循环和内存错误

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-01-03 17:34:17

解决方案2
0 2018-01-03 18:43:42

多个元素的点积，无循环和内存错误

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-01-03 17:34:17

解决方案2 0 2018-01-03 18:43:42

解决方案1
3 已采纳 2018-01-03 17:34:17

解决方案2
0 2018-01-03 18:43:42