Numpy element-wise dot product without loop and memory error

Question

I am dealing with a simple question with numpy. I have two lists of matrices - say A,B - encoded as 3D arrays with shapes (n,p,q) and (n,q,r) respectively.

I want to compute their element-wise dot product, that is a 3D-array C such that C[i,j,l] = sum A[i,j,:] B[i,:,l] . This is very simple mathematically speaking, but here are the rules I must follow:

1) I must only use numpy functions ( dot , tensordot , einsum , etc.): no loop & cie. This is because I want this to work on my gpu (with cupy) and loops are awful on it. I want all operations to be made on the current device.

2) Since my data can be quite large, typically A and B already take few dozens of Mb in memory, I don't want to build any items with bigger shapes than (n,p,q),(n,q,r),(n,p,r) (no intermediate 4D arrays must be store).

For example, the solution I have found there , that is using:

C = np.sum(np.transpose(A,(0,2,1)).reshape(n,p,q,1)*B.reshape(n,q,1,r),-3)

is correct mathematically speaking, but implies the intermediate creation of a (n,p,q,r) array which is too big for my purpose.

I had similar trouble with something like

C = np.einsum('ipq,iqr->ipr',A,B)

I don't know what are the underlying operations & constructions, but it always leads to a memory error.

On the other hand, something a bit naive like:

C = np.array([A[i].dot(B[i]) for i in range(n)])

seems ok in terms of memory but is not efficient on my gpu: the list is build on the CPU it seems, and re-allocating it to gpu is slow (if there is a cupy-friendly way to write that, it would be a nice solution!)

Thank you for your help !

Answer 1

You want numpy.matmul ( cupy version here ). matmul is a "broadcasting" matrix multiply.

I think folks have known that the numpy.dot semantics are wonky and that a broadcasting matrix multiply was needed, but there wasn't much momentum to introduce the change until python got the @ operator. I don't see dot going anywhere, but I suspect the better semantics and the ease of doing A @ B will mean that dot will fall out of favor as folks discover the new function and operator.

Answer 2

The iterative method that you seek to avoid might not be so bad. Consider, for example, these timings:

In [51]: A = np.ones((100,10,10))
In [52]: timeit np.array([A[i].dot(A[i]) for i in range(A.shape[0])])
439 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [53]: timeit np.einsum('ipq,iqr->ipr',A,A)
428 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [54]: timeit A@A
426 µs ± 54.6 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For this case all three take about the same time.

But I double the later dimensions, the iterative approach is actually faster:

In [55]: A = np.ones((100,20,20))
In [56]: timeit np.array([A[i].dot(A[i]) for i in range(A.shape[0])])
702 µs ± 1.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [57]: timeit np.einsum('ipq,iqr->ipr',A,A)
1.89 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [58]: timeit A@A
1.89 ms ± 490 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The same pattern holds when I change the 20 to 30 and 40. I'm mildly surprised that matmul times match einsum so closely.

I suppose I could try pushing these to memory limits. I don't have a fancy backend to test that aspect.

A modest number of iterations over a large problem isn't so horrible, once you take into account memory management issues. The thing you want avoid, in numpy, is many iterations over a simple task.

Numpy element-wise dot product without loop and memory error

Question

2 answers

solution1
3 ACCPTED 2018-01-03 17:34:17

solution2
0 2018-01-03 18:43:42

Numpy element-wise dot product without loop and memory error

Question

2 answers

solution1 3 ACCPTED 2018-01-03 17:34:17

solution2 0 2018-01-03 18:43:42

solution1
3 ACCPTED 2018-01-03 17:34:17

solution2
0 2018-01-03 18:43:42