We are looking for the most time efficient method in order to implement a matrix multiplication G * m
, where G
is a large matrix of size 3N x 3510
and m
is a vector of length 3510.
G
is in general very large, so, due to memory problems, we decomposed G
into two smaller matrices G_d
and G_s
such that G
is a special outer product of them. Then we use the numpy.einsum
function in order to calculate G*m
.
import numpy as np
import time
def func1(m, G_d, G_s):
start = time.time()
C = np.reshape(m, (np.shape(G_s)[1], np.shape(G_d)[1]), order='F')
res = np.einsum('ik, ij, jk -> i', G_d, G_s, C, optimize=True)
end = time.time()
print(end - start)
return res
However, we also tested to save (here a smaller) G
and to use the numpy.matmul
function directly.
def get_matrix(G_d, G_s):
G = G_d[:, :, None] * G_s[:, None, :]
G = np.reshape(G, (np.shape(G_d)[0], -1))
return G
def func2(m):
start = time.time()
global G
res = np.matmul(G, m)
end = time.time()
print(end - start)
return res
Using
G_d = np.random.rand(60000, 195)
G_s = np.random.rand(60000, 18)
m = np.random.rand(3510)
G = get_matrix(G_d, G_s)
print(func1(m, G_d, G_s))
print(func2(m))
one gets 0.03707623481750488
for the time of func1
and 0.12322783470153809
for the time of func2
. This means that einsum
is very much faster then matmul
here (the results of the functions are indeed equal, I omit them here).
Concerning this point, I recognised this discussion. I wanted to check the time needed for the matrix multiplication with einsum
also with
def func3(m):
start = time.time()
global G
res = np.einsum('ij, j -> i', G, m, optimize=True)
end = time.time()
print(end - start)
return res
This gives 0.1230616569519043
seconds similar to func2
. My question is:
Why is
func1
much more faster thanfunc2
andfunc3
?
My colleague and I are wondering about this point since we estimated that func2
and func3
use much less floating point operations than func1
. So our prediction was that func1
should be slower than the other ones, but this is not the case.
It has been shown in other SO that memory management issues can chew into the performance of matmul
for large arrays. Usually the alternative is to spit that 60000
dimension into chunks.
But to focus more on the number of calculations, lets experiment with a smaller 'batch' size, 100.
In [59]: G.shape
Out[59]: (100, 3510)
In [60]: C = np.reshape(m, (np.shape(G_s)[1], np.shape(G_d)[1]), order='F')
In [61]: C.shape
Out[61]: (18, 195)
calculations:
In [62]: x1=G@m
In [63]: x1.shape
Out[63]: (100,)
In [64]: x2=np.einsum('ij,j',G,m)
In [65]: np.allclose(x1,x2)
Out[65]: True
In [66]: x3=np.einsum('ik,ij,jk->i',G_d,G_s,C)
In [67]: np.allclose(x1,x3)
Out[67]: True
timings:
In [68]: timeit x1=G@m
106 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [69]: timeit x2=np.einsum('ij,j',G,m)
257 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [70]: timeit x3=np.einsum('ik,ij,jk->i',G_d,G_s,C)
1.22 ms ± 326 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [71]: timeit x3=np.einsum('ik,ij,jk->i',G_d,G_s,C, optimize=True)
309 µs ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
At this size, the staight forward matmul
does best. The arrays are passed with little change to fast BLAS like library.
optimize=True
allows einsum
to perform the calculations in an optimal order:
In [72]: np.einsum_path('ik,ij,jk->i',G_d,G_s,C, optimize=True)
...:
Out[72]:
(['einsum_path', (0, 2), (0, 1)],
...
Complete contraction: ik,ij,jk->i
Naive scaling: 3
Optimized scaling: 3
Naive FLOP count: 1.053e+06
Optimized FLOP count: 7.056e+05
Theoretical speedup: 1.492
Largest intermediate: 1.800e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
3 jk,ik->ji ij,ji->i
2 ji,ij->i
So it's doing, in effect, this einsum
sequence:
In [94]: timeit x3=np.einsum('ij,ij->i',G_s,np.einsum('ik,jk->ij',G_d,C))
301 µs ± 83.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
matmul
equivalent is:
In [81]: x4=np.squeeze(G_s[:,None,:]@(G_d@C.T)[:,:,None])
In [82]: x4.shape
Out[82]: (100,)
In [83]: np.allclose(x1,x4)
Out[83]: True
In [84]: timeit x4=np.squeeze(G_s[:,None,:]@(G_d@C.T)[:,:,None])
114 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It uses G_d
first, it reduces the dimensions fastest: (100,195) => (100,18) => (100,).
I won't take time now to explore larger arrays, and the memory management effect.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.