简体   繁体   English

使用 cython 加速 numpy 矩阵乘法

[英]Speed-up numpy matrix multiplication using cython

I am computing a matrix multiplication at few thousand times during my algorithm.我在我的算法中计算了几千次矩阵乘法。 Therefore, I compute:因此,我计算:

import numpy as np
import time


def mat_mul(mat1, mat2, mat3, mat4):
    return(np.dot(np.transpose(mat1),np.multiply(np.diag(mat2)[:,None], mat3))+mat4)

n = 2000
mat1 = np.random.rand(n,n)
mat2 = np.diag(np.random.rand(n))
mat3 = np.random.rand(n,n)
mat4 = np.random.rand(n,n)

t0=time.time()
cov_11=mat_mul(mat1, mat2, mat1, mat4)
t1=time.time()
print('time ',t1-t0, 's')

The matrices are of size: n = (2000,2000) and mat2 only has entries along its diagonal.矩阵的大小为:n = (2000,2000) 并且 mat2 仅沿其对角线具有条目。 The remaining entries are zero.其余条目为零。

On my machine I get the following: time 0.3473696708679199 s在我的机器上,我得到以下信息: time 0.3473696708679199 s

How can I speed this up?我怎样才能加快速度?

Thanks.谢谢。

The Numpy implementation can be optimized a bit by reducing the amount of temporary arrays and reuse them as much as possible (ie. multiple times). Numpy 实现可以通过减少临时数组的数量并尽可能多地重用它们(即多次)来优化。 Indeed, while matrix multiplications are generally heavily-optimized by BLAS implementations, filling/copying (newly allocated) arrays add a non-negligible overhead.事实上,虽然矩阵乘法通常通过 BLAS 实现进行了高度优化,但填充/复制(新分配的)数组会增加不可忽略的开销。

Here is the implementation:这是实现:

def mat_mul_opt(mat1, mat2, mat3, mat4):
    tmp1 = np.empty((n,n))
    tmp2 = np.empty((n,n))
    vect = np.diag(mat2)[:,None]
    np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
    np.add(mat4, tmp2, out=tmp1)
    return tmp1

The code can be optimized further if it is fine to mutate input matrices or if you can pre-allocate tmp1 and tmp2 outside the function once (and then reuse them multiple times).如果可以对输入矩阵进行变异,或者您可以在函数外部预先分配tmp1tmp2一次(然后多次重用它们),则可以进一步优化代码。 Here is an example:这是一个例子:

def mat_mul_opt2(mat1, mat2, mat3, mat4, tmp1, tmp2):
    vect = np.diag(mat2)[:,None]
    np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
    np.add(mat4, tmp2, out=tmp1)
    return tmp1

Here are performance results on my i5-9600KF processor (6-cores):以下是我的 i5-9600KF 处理器(6 核)的性能结果:

mat_mul:                 103.6 ms
mat_mul_opt1:             96.7 ms
mat_mul_opt2:             83.5 ms
np.dot time only:         74.4 ms   (kind of practical lower-bound)
Optimal lower bound:      55   ms   (quite optimistic)

cython is not going to speed it up, simply because numpy is using other tricks to speed things up like threading and SIMD, anyone that tries to implement such function with only cython is going to end up with much worse performance. cython 不会加速它,仅仅因为 numpy 正在使用其他技巧来加快速度,例如线程和 SIMD,任何试图仅使用 cython 实现此类功能的人最终都会得到更差的性能。

only 2 things are possible:只有两件事是可能的:

  1. use a gpu based version of numpy (cupy)使用基于 gpu 的 numpy (cupy) 版本
  2. use a different more optimized backend for numpy if you aren't using the best already (like intel MKL)如果您还没有使用最好的后端(例如 intel MKL),请为 numpy 使用不同的更优化的后端

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM