简体   繁体   中英

Speed-up numpy matrix multiplication using cython

I am computing a matrix multiplication at few thousand times during my algorithm. Therefore, I compute:

import numpy as np
import time


def mat_mul(mat1, mat2, mat3, mat4):
    return(np.dot(np.transpose(mat1),np.multiply(np.diag(mat2)[:,None], mat3))+mat4)

n = 2000
mat1 = np.random.rand(n,n)
mat2 = np.diag(np.random.rand(n))
mat3 = np.random.rand(n,n)
mat4 = np.random.rand(n,n)

t0=time.time()
cov_11=mat_mul(mat1, mat2, mat1, mat4)
t1=time.time()
print('time ',t1-t0, 's')

The matrices are of size: n = (2000,2000) and mat2 only has entries along its diagonal. The remaining entries are zero.

On my machine I get the following: time 0.3473696708679199 s

How can I speed this up?

Thanks.

The Numpy implementation can be optimized a bit by reducing the amount of temporary arrays and reuse them as much as possible (ie. multiple times). Indeed, while matrix multiplications are generally heavily-optimized by BLAS implementations, filling/copying (newly allocated) arrays add a non-negligible overhead.

Here is the implementation:

def mat_mul_opt(mat1, mat2, mat3, mat4):
    tmp1 = np.empty((n,n))
    tmp2 = np.empty((n,n))
    vect = np.diag(mat2)[:,None]
    np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
    np.add(mat4, tmp2, out=tmp1)
    return tmp1

The code can be optimized further if it is fine to mutate input matrices or if you can pre-allocate tmp1 and tmp2 outside the function once (and then reuse them multiple times). Here is an example:

def mat_mul_opt2(mat1, mat2, mat3, mat4, tmp1, tmp2):
    vect = np.diag(mat2)[:,None]
    np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
    np.add(mat4, tmp2, out=tmp1)
    return tmp1

Here are performance results on my i5-9600KF processor (6-cores):

mat_mul:                 103.6 ms
mat_mul_opt1:             96.7 ms
mat_mul_opt2:             83.5 ms
np.dot time only:         74.4 ms   (kind of practical lower-bound)
Optimal lower bound:      55   ms   (quite optimistic)

cython is not going to speed it up, simply because numpy is using other tricks to speed things up like threading and SIMD, anyone that tries to implement such function with only cython is going to end up with much worse performance.

only 2 things are possible:

  1. use a gpu based version of numpy (cupy)
  2. use a different more optimized backend for numpy if you aren't using the best already (like intel MKL)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM