简体   繁体   English

优化python / numpy中的矩阵写入

[英]Optimizing matrix writes in python/numpy

I'm currently trying to optimize a piece of code the gist of it is we go through and compute a bunch of values and write them to a matrix. 我目前正在尝试优化一段代码,其依据是我们计算并计算一堆值并将它们写入矩阵。 The order of computation doesn't matter: 计算顺序无关紧要:

mat =  np.zeros((n, n))
mat.fill(MAX_VAL)
for i in xrange(0, smallerDim):
    for j in xrange(0,n):
        similarityVal = doACalculation(i,j, data, cache)
        mat[i][j] = abs(1.0 / (similarityVal + 1.0))

I've profiled this code and have found that approximately 90% of the time is spent on writing the value back into the matrix (the last line) 我分析了这段代码,发现大约90%的时间都花在了将值写回到矩阵中(最后一行)

I'm wondering what the optimal way to do this type of computation to optimize the writes. 我想知道进行这种类型的计算以优化写入的最佳方法是什么。 Should I write to an intermediate buffer and copy in the whole row etc etc. I'm a bit clueless to performance tuning or numpy internals. 我应该写入中间缓冲区并复制整行等吗?对于性能调优或numpy内部结构我一点都不了解。

EDIT: doACalculation is not a side-effect free function. 编辑:doACalculation不是没有副作用的函数。 It takes in some data (assume this is some python object) and also a cache to which it writes and reads some intermediate steps. 它接收一些数据(假设这是一些python对象),还接收其写入和读取一些中间步骤的缓存。 I'm not sure if it can easily be vectorized. 我不确定是否可以轻松将其向量化。 I tried using numpy.vectorize as recommended but did not see a significant speedup over the naive for loop. 我尝试按照建议使用numpy.vectorize,但没有发现天真的for循环有明显的提速。 (I passed in the additional data via a state variable): (我通过状态变量传入了其他数据):

Wrapping it in a numba autojit should improve the performance quite a bit. 将其包装在numba autojit中会大大提高性能。

def doACalculationVector(n, smallerDim):
    return np.ones((smallerDim, n)) + 1


def testVector():
    n = 1000
    smallerDim = 800
    mat =  np.zeros((n, n))
    mat.fill(10) 
    mat[:smallerDim] = abs(1.0 / (doACalculationVector(n, smallerDim) + 1.0))
    return mat

@numba.autojit
def doACalculationNumba(i,j):
    return 2

@numba.autojit
def testNumba():
    n = 1000
    smallerDim = 800
    mat =  np.zeros((n, n))
    mat.fill(10)
    for i in xrange(0, smallerDim):
        for j in xrange(0, n):
            mat[i,j] = abs(1.0 / (doACalculationNumba(i, j) + 1.0))
    return mat

Original Timing for reference: (with mat[i][j] changed to mat[i,j] ) 供参考的原始时序:(将mat[i][j]更改为mat[i,j]

In [24]: %timeit test()
1 loops, best of 3: 226 ms per loop

Now i simplified the function a bit since this was all that was provided. 现在我稍微简化了功能,因为这就是所提供的全部。 But testNumba was about 40 times as fast as test when timed. 但是testNumba的速度大约定时测试的40倍 and about 3 times as fast as the vectorized 大约是向量化速度的3倍

In [20]: %timeit testVector()
100 loops, best of 3: 17.9 ms per loop

In [21]: %timeit testNumba()
100 loops, best of 3: 5.91 ms per loop

If you can vectorize doACalculation , the task becomes easy: 如果可以向量化doACalculation ,任务将变得容易:

similarityArray = doACalculation(np.indices((smallerDim, n)))
mat[:smallerDim] = np.abs(1.0 / (similarityArray + 1))

This should be at least an order of magnitude faster, assuming you vectorize doACalculation properly. 假设您正确地向量化了doACalculation ,这应该至少快一个数量级。 Generally, when working with NumPy arrays, you want to avoid explicit loops and element accesses as much as possible. 通常,在使用NumPy数组时,您要尽可能避免显式循环和元素访问。

For reference, an example vectorization of a possible doACalculation : 作为参考,一个可能的doACalculation的示例矢量化:

# Unvectorized
def doACalculation(i, j):
    return i**2 + i*j + j

# Vectorized
def doACalculation(input):
    i, j = input
    return i**2 + i*j + j

# Vectorized, but with the original call signature
def doACalculation(i, j):
    return i**2 + i*j + j

Yes, the last version really is supposed to be identical to the unvectorized function. 是的,最后一个版本实际上应该与未向量化的功能相同。 It sometimes is that easy. 有时候就是那么容易。

Even if you can't vectorize doACalculation() . 即使您无法向量化doACalculation() You can use numpy.vectorize() to speed the calculation. 您可以使用numpy.vectorize()加快计算速度。 Here is the test. 这是测试。

import numpy as np
n = 1000
smallerDim = 500

def doACalculation(i, j):
    return i+j

For loop version: 对于循环版本:

%%timeit
mat =  np.zeros((n, n))

for i in xrange(0, smallerDim):
    for j in xrange(0,n):
        similarityVal = doACalculation(i,j)
        mat[i,j] = abs(1.0 / (similarityVal + 1.0))

output: 输出:

1 loops, best of 3: 183 ms per loop

vectorize() version: vectorize()版本:

%%timeit
mat2 =  np.zeros((n, n))
i, j = np.ix_(np.arange(smallerDim), np.arange(n))
f = np.vectorize(doACalculation, "d")
mat2[:smallerDim] = np.abs(1.0/(f(i, j) + 1))

output: 输出:

10 loops, best of 3: 97.3 ms per loop

Test result: 测试结果:

np.allclose(mat,mat2)

outpout: 输出:

True

This method doesn't make doACalculation() calling much faster, but it make it possible that subsequent calculation can be done vectorized. 此方法不会使doACalculation()调用速度更快,但可以使后续计算可以向量化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM