简体   繁体   中英

Numpy element-wise dot product without loop and memory error

I am dealing with a simple question with numpy. I have two lists of matrices - say A,B - encoded as 3D arrays with shapes (n,p,q) and (n,q,r) respectively.

I want to compute their element-wise dot product, that is a 3D-array C such that C[i,j,l] = sum A[i,j,:] B[i,:,l] . This is very simple mathematically speaking, but here are the rules I must follow:

1) I must only use numpy functions ( dot , tensordot , einsum , etc.): no loop & cie. This is because I want this to work on my gpu (with cupy) and loops are awful on it. I want all operations to be made on the current device.

2) Since my data can be quite large, typically A and B already take few dozens of Mb in memory, I don't want to build any items with bigger shapes than (n,p,q),(n,q,r),(n,p,r) (no intermediate 4D arrays must be store).

For example, the solution I have found there , that is using:

C = np.sum(np.transpose(A,(0,2,1)).reshape(n,p,q,1)*B.reshape(n,q,1,r),-3)

is correct mathematically speaking, but implies the intermediate creation of a (n,p,q,r) array which is too big for my purpose.

I had similar trouble with something like

C = np.einsum('ipq,iqr->ipr',A,B)

I don't know what are the underlying operations & constructions, but it always leads to a memory error.

On the other hand, something a bit naive like:

C = np.array([A[i].dot(B[i]) for i in range(n)])

seems ok in terms of memory but is not efficient on my gpu: the list is build on the CPU it seems, and re-allocating it to gpu is slow (if there is a cupy-friendly way to write that, it would be a nice solution!)

Thank you for your help !

You want numpy.matmul ( cupy version here ). matmul is a "broadcasting" matrix multiply.

I think folks have known that the numpy.dot semantics are wonky and that a broadcasting matrix multiply was needed, but there wasn't much momentum to introduce the change until python got the @ operator. I don't see dot going anywhere, but I suspect the better semantics and the ease of doing A @ B will mean that dot will fall out of favor as folks discover the new function and operator.

The iterative method that you seek to avoid might not be so bad. Consider, for example, these timings:

In [51]: A = np.ones((100,10,10))
In [52]: timeit np.array([A[i].dot(A[i]) for i in range(A.shape[0])])
439 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [53]: timeit np.einsum('ipq,iqr->ipr',A,A)
428 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [54]: timeit A@A
426 µs ± 54.6 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For this case all three take about the same time.

But I double the later dimensions, the iterative approach is actually faster:

In [55]: A = np.ones((100,20,20))
In [56]: timeit np.array([A[i].dot(A[i]) for i in range(A.shape[0])])
702 µs ± 1.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [57]: timeit np.einsum('ipq,iqr->ipr',A,A)
1.89 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [58]: timeit A@A
1.89 ms ± 490 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The same pattern holds when I change the 20 to 30 and 40. I'm mildly surprised that matmul times match einsum so closely.

I suppose I could try pushing these to memory limits. I don't have a fancy backend to test that aspect.

A modest number of iterations over a large problem isn't so horrible, once you take into account memory management issues. The thing you want avoid, in numpy, is many iterations over a simple task.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM