I am computing a matrix multiplication at few thousand times during my algorithm. Therefore, I compute:
import numpy as np
import time
def mat_mul(mat1, mat2, mat3, mat4):
return(np.dot(np.transpose(mat1),np.multiply(np.diag(mat2)[:,None], mat3))+mat4)
n = 2000
mat1 = np.random.rand(n,n)
mat2 = np.diag(np.random.rand(n))
mat3 = np.random.rand(n,n)
mat4 = np.random.rand(n,n)
t0=time.time()
cov_11=mat_mul(mat1, mat2, mat1, mat4)
t1=time.time()
print('time ',t1-t0, 's')
The matrices are of size: n = (2000,2000) and mat2 only has entries along its diagonal. The remaining entries are zero.
On my machine I get the following: time 0.3473696708679199 s
How can I speed this up?
Thanks.
The Numpy implementation can be optimized a bit by reducing the amount of temporary arrays and reuse them as much as possible (ie. multiple times). Indeed, while matrix multiplications are generally heavily-optimized by BLAS implementations, filling/copying (newly allocated) arrays add a non-negligible overhead.
Here is the implementation:
def mat_mul_opt(mat1, mat2, mat3, mat4):
tmp1 = np.empty((n,n))
tmp2 = np.empty((n,n))
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
The code can be optimized further if it is fine to mutate input matrices or if you can pre-allocate tmp1
and tmp2
outside the function once (and then reuse them multiple times). Here is an example:
def mat_mul_opt2(mat1, mat2, mat3, mat4, tmp1, tmp2):
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
Here are performance results on my i5-9600KF processor (6-cores):
mat_mul: 103.6 ms
mat_mul_opt1: 96.7 ms
mat_mul_opt2: 83.5 ms
np.dot time only: 74.4 ms (kind of practical lower-bound)
Optimal lower bound: 55 ms (quite optimistic)
cython is not going to speed it up, simply because numpy is using other tricks to speed things up like threading and SIMD, anyone that tries to implement such function with only cython is going to end up with much worse performance.
only 2 things are possible:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.