Why is ordinary matrix multiplication here much slower than a self-implemented approach?

Question

We are looking for the most time efficient method in order to implement a matrix multiplication G * m , where G is a large matrix of size 3N x 3510 and m is a vector of length 3510.

G is in general very large, so, due to memory problems, we decomposed G into two smaller matrices G_d and G_s such that G is a special outer product of them. Then we use the numpy.einsum function in order to calculate G*m .

import numpy as np
import time


def func1(m, G_d, G_s):

    start = time.time()
    C = np.reshape(m, (np.shape(G_s)[1], np.shape(G_d)[1]), order='F')
    res = np.einsum('ik, ij, jk -> i', G_d, G_s, C, optimize=True)
    end = time.time()

    print(end - start)
    return res

However, we also tested to save (here a smaller) G and to use the numpy.matmul function directly.

def get_matrix(G_d, G_s):

    G = G_d[:, :, None] * G_s[:, None, :]
    G = np.reshape(G, (np.shape(G_d)[0], -1))

    return G


def func2(m):

    start = time.time()
    global G
    res = np.matmul(G, m)
    end = time.time()

    print(end - start)
    return res

Using

G_d = np.random.rand(60000, 195)
G_s = np.random.rand(60000, 18)
m = np.random.rand(3510)

G = get_matrix(G_d, G_s)


print(func1(m, G_d, G_s))
print(func2(m))

one gets 0.03707623481750488 for the time of func1 and 0.12322783470153809 for the time of func2 . This means that einsum is very much faster then matmul here (the results of the functions are indeed equal, I omit them here).

Concerning this point, I recognised this discussion. I wanted to check the time needed for the matrix multiplication with einsum also with

def func3(m):

    start = time.time()
    global G
    res = np.einsum('ij, j -> i', G, m, optimize=True)
    end = time.time()

    print(end - start)
    return res

This gives 0.1230616569519043 seconds similar to func2 . My question is:

Why is func1 much more faster than func2 and func3 ?

My colleague and I are wondering about this point since we estimated that func2 and func3 use much less floating point operations than func1 . So our prediction was that func1 should be slower than the other ones, but this is not the case.

Answer 1

It has been shown in other SO that memory management issues can chew into the performance of matmul for large arrays. Usually the alternative is to spit that 60000 dimension into chunks.

But to focus more on the number of calculations, lets experiment with a smaller 'batch' size, 100.

In [59]: G.shape
Out[59]: (100, 3510)
In [60]: C = np.reshape(m, (np.shape(G_s)[1], np.shape(G_d)[1]), order='F')
In [61]: C.shape
Out[61]: (18, 195)

calculations:

In [62]: x1=G@m
In [63]: x1.shape
Out[63]: (100,)
In [64]: x2=np.einsum('ij,j',G,m)
In [65]: np.allclose(x1,x2)
Out[65]: True
In [66]: x3=np.einsum('ik,ij,jk->i',G_d,G_s,C)
In [67]: np.allclose(x1,x3)
Out[67]: True

timings:

In [68]: timeit x1=G@m
106 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [69]: timeit x2=np.einsum('ij,j',G,m)
257 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [70]: timeit x3=np.einsum('ik,ij,jk->i',G_d,G_s,C)
1.22 ms ± 326 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [71]: timeit x3=np.einsum('ik,ij,jk->i',G_d,G_s,C, optimize=True)
309 µs ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

At this size, the staight forward matmul does best. The arrays are passed with little change to fast BLAS like library.

optimize=True allows einsum to perform the calculations in an optimal order:

In [72]: np.einsum_path('ik,ij,jk->i',G_d,G_s,C, optimize=True)
    ...: 
Out[72]: 
(['einsum_path', (0, 2), (0, 1)],
...
  Complete contraction:  ik,ij,jk->i
         Naive scaling:  3
     Optimized scaling:  3
      Naive FLOP count:  1.053e+06
  Optimized FLOP count:  7.056e+05
   Theoretical speedup:  1.492
  Largest intermediate:  1.800e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   3                   jk,ik->ji                                 ij,ji->i
   2                    ji,ij->i

So it's doing, in effect, this einsum sequence:

In [94]: timeit x3=np.einsum('ij,ij->i',G_s,np.einsum('ik,jk->ij',G_d,C))
301 µs ± 83.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

matmul equivalent is:

In [81]: x4=np.squeeze(G_s[:,None,:]@(G_d@C.T)[:,:,None])
In [82]: x4.shape
Out[82]: (100,)
In [83]: np.allclose(x1,x4)
Out[83]: True
In [84]: timeit x4=np.squeeze(G_s[:,None,:]@(G_d@C.T)[:,:,None])
114 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

It uses G_d first, it reduces the dimensions fastest: (100,195) => (100,18) => (100,).

I won't take time now to explore larger arrays, and the memory management effect.

Why is ordinary matrix multiplication here much slower than a self-implemented approach?

Question

1 answers

solution1
0 2020-10-22 16:01:39

Why is ordinary matrix multiplication here much slower than a self-implemented approach?

Question

1 answers

solution1 0 2020-10-22 16:01:39

solution1
0 2020-10-22 16:01:39