简体   繁体   English

代码优化-Python中的函数调用次数

[英]Code optimization - number of function calls in Python

I'd like to know how I might be able to transform this problem to reduce the overhead of the np.sum() function calls in my code. 我想知道如何解决这个问题,以减少代码中np.sum()函数调用的开销。

I have an input matrix, say of shape=(1000, 36) . 我有一个input矩阵,比如shape=(1000, 36) Each row represents a node in a graph. 每行代表图中的一个节点。 I have an operation that I am doing, which is iterating over each row and doing an element-wise addition to a variable number of other rows. 我正在执行一个操作,该操作遍历每行,并对可变数量的其他行进行逐元素加法。 Those "other" rows are defined in a dictionary nodes_nbrs that records, for each row, a list of rows that must be summed together. 这些“其他”行在字典nodes_nbrs中定义,该字典为每行记录必须加总的行列表。 An example is as such: 例如:

nodes_nbrs = {0: [0, 1], 
              1: [1, 0, 2],
              2: [2, 1],
              ...}

Here, node 0 would be transformed into the sum of nodes 0 and 1 . 在这里,节点0将被转换为节点01的总和。 Node 1 would be transformed into the sum of nodes 1 , 0 , and 2 . 节点1将被转化为节点之和10 ,和2 And so on for the rest of the nodes. 对于其余的节点,依此类推。

The current (and naive) way I currently have implemented is as such. 我当前实现的当前(也是幼稚的)方式就是这样。 I first instantiate a zero array of the final shape that I want, and then iterate over each key-value pair in the nodes_nbrs dictionary: 我首先实例化一个想要的最终形状的零数组,然后遍历nodes_nbrs词典中的每个键值对:

output = np.zeros(shape=input.shape)
for k, v in nodes_nbrs.items():
    output[k] = np.sum(input[v], axis=0)

This code is all cool and fine in small tests ( shape=(1000, 36) ), but on larger tests ( shape=(~1E(5-6), 36) ), it takes ~2-3 seconds to complete. 这段代码在小型测试( shape=(1000, 36) )中都很酷,但是在大型测试( shape=(~1E(5-6), 36) )中,大约需要2-3秒才能完成。 I end up having to do this operation thousands of times, so I'm trying to see if there's a more optimized way of doing this. 我最终不得不执行此操作数千次,因此我试图查看是否有更优化的方法来执行此操作。

After doing line profiling, I noticed that the key killer here is calling the np.sum function over and over, which takes about 50% of the total time. 在进行行分析之后,我注意到这里的关键杀手是np.sum调用np.sum函数,这大约占总时间的50%。 Is there a way I can eliminate this overhead? 有没有办法可以消除这种开销? Or is there another way I can optimize this? 还是我可以优化它的另一种方法?


Apart from that, here is a list of things I have done, and (very briefly) their results: 除此之外,这是我所做的事情的清单,以及它们的结果(非常简短):

  • A cython version: eliminates the for loop type checking overhead, 30% reduction in time taken. cython版本:消除了for循环类型检查的开销,减少了30%的时间。 With the cython version, np.sum takes about 80% of the total wall clock time, rather than 50%. 使用cython版本时, np.sum大约占总挂钟时间的80%,而不是50%。
  • Pre-declare np.sum as a variable npsum , and then call npsum inside the for loop. np.sum预先声明为变量npsum ,然后在for循环内调用npsum No difference with original. 与原件没有区别。
  • Replace np.sum with np.add.reduce , and assign that to the variable npsum , and then call npsum inside the for loop. np.sum替换为np.add.reduce ,并将其分配给变量npsum ,然后在for循环内调用npsum ~10% reduction in wall clock time, but then incompatible with autograd (explanation below in sparse matrices bullet point). 挂钟时间减少了约10%,但与autograd不兼容(下面在稀疏矩阵项目符号中进行了说明)。
  • numba JIT-ing: did not attempt more than adding decorator. numba JIT-ing:除了添加装饰器,没有其他尝试。 No improvement, but didn't try harder. 没有改善,但是没有努力。
  • Convert the nodes_nbrs dictionary into a dense numpy binary array (1s and 0s), and then do a single np.dot operation. nodes_nbrs词典转换为密集的numpy二进制数组(1s和0s),然后执行单个np.dot操作。 Good in theory, bad in practice because it would require a square matrix of shape=(10^n, 10^n) , which is quadratic in memory usage. 理论上不错,实践上很不好,因为它需要一个shape=(10^n, 10^n)方阵,这在内存使用方面是二次方的。

Things I have not tried, but am hesitant to do so: 我没有尝试过的事情,但犹豫不决:

  • scipy sparse matrices: I am using autograd , which does not support automatic differentiation of the dot operation for scipy sparse matrices. scipy稀疏矩阵:我正在使用autograd ,它不支持对scipy稀疏矩阵进行dot运算的自动区分。

For those who are curious, this is essentially a convolution operation on graph-structured data. 对于那些好奇的人,这本质上是对图结构数据的卷积运算。 Kinda fun developing this for grad school, but also somewhat frustrating being at the cutting edge of knowledge. Kinda很高兴为研究生学校开发此软件,但在知识的最前沿也有些沮丧。

If scipy.sparse is not an option, one way you might approach this would be to massage your data so that you can use vectorized functions to do everything in the compiled layer. 如果scipy.sparse不是一个选项,则您可能会采用的一种方法是对数据进行按摩,以便可以使用矢量化函数来完成编译层中的所有操作。 If you change your neighbors dictionary into a two-dimensional array with appropriate flags for missing values, you can use np.take to extract the data you want and then do a single sum() call. 如果将邻居字典更改为带有缺失值的适当标志的二维数组,则可以使用np.take提取所需的数据,然后执行一次sum()调用。

Here's an example of what I have in mind: 这是我想到的一个例子:

import numpy as np

def make_data(N=100):
    X = np.random.randint(1, 20, (N, 36))
    connections = np.random.randint(2, 5, N)
    nbrs = {i: list(np.random.choice(N, c))
            for i, c in enumerate(connections)}
    return X, nbrs

def original_solution(X, nbrs):
    output = np.zeros(shape=X.shape)
    for k, v in nbrs.items():
        output[k] = np.sum(X[v], axis=0)
    return output

def vectorized_solution(X, nbrs):
    # Make neighbors all the same length, filling with -1
    new_nbrs = np.full((X.shape[0], max(map(len, nbrs.values()))), -1, dtype=int)
    for i, v in nbrs.items():
        new_nbrs[i, :len(v)] = v

    # add a row of zeros to X
    new_X = np.vstack([X, 0 * X[0]])

    # compute the sums
    return new_X.take(new_nbrs, 0).sum(1)

Now we can confirm that the results match: 现在我们可以确认结果匹配:

>>> X, nbrs = make_data(100)
>>> np.allclose(original_solution(X, nbrs),
                vectorized_solution(X, nbrs))
True

And we can time things to see the speedup: 我们可以安排时间查看加速:

X, nbrs = make_data(1000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
# 100 loops, best of 3: 13.7 ms per loop
# 100 loops, best of 3: 1.89 ms per loop

Going up to larger sizes: 增大尺寸:

X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
1 loop, best of 3: 1.42 s per loop
1 loop, best of 3: 249 ms per loop

It's about a factor of 5-10 faster, which may be good enough for your purposes (though this will heavily depend on the exact characteristics of your nbrs dictionary). 它大约快5-10倍,这可能足以满足您的目的(尽管这在很大程度上取决于nbrs词典的确切特征)。


Edit: Just for fun, I tried a couple other approaches, one using numpy.add.reduceat , one using pandas.groupby , and one using scipy.sparse . 编辑:只是为了好玩,我尝试了几种其他方法,一种使用numpy.add.reduceat ,一种使用pandas.groupby ,一种使用scipy.sparse It seems that the vectorized approach I originally proposed above is probably the best bet. 看来我最初在上面提出的矢量化方法可能是最好的选择。 Here they are for reference: 这里供参考:

from itertools import chain

def reduceat_solution(X, nbrs):
    ind, j = np.transpose([[i, len(v)] for i, v in nbrs.items()])
    i = list(chain(*(nbrs[i] for i in ind)))
    j = np.concatenate([[0], np.cumsum(j)[:-1]])
    return np.add.reduceat(X[i], j)[ind]

np.allclose(original_solution(X, nbrs),
            reduceat_solution(X, nbrs))
# True

- --

import pandas as pd

def groupby_solution(X, nbrs):
    i, j = np.transpose([[k, vi] for k, v in nbrs.items() for vi in v])
    return pd.groupby(pd.DataFrame(X[j]), i).sum().values

np.allclose(original_solution(X, nbrs),
            groupby_solution(X, nbrs))
# True

- --

from scipy.sparse import csr_matrix
from itertools import chain

def sparse_solution(X, nbrs):
    items = (([i]*len(col), col, [1]*len(col)) for i, col in nbrs.items())
    rows, cols, data = (np.array(list(chain(*a))) for a in zip(*items))
    M = csr_matrix((data, (rows, cols)))
    return M.dot(X)

np.allclose(original_solution(X, nbrs),
            sparse_solution(X, nbrs))
# True

And all the timings together: 和所有的时间在一起:

X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
%timeit reduceat_solution(X, nbrs)
%timeit groupby_solution(X, nbrs)
%timeit sparse_solution(X, nbrs)
# 1 loop, best of 3: 1.46 s per loop
# 1 loop, best of 3: 268 ms per loop
# 1 loop, best of 3: 416 ms per loop
# 1 loop, best of 3: 657 ms per loop
# 1 loop, best of 3: 282 ms per loop

Based on work on recent sparse questions, eg Extremely slow sum row operation in Sparse LIL matrix in Python 基于最近稀疏问题的工作,例如,Python的稀疏LIL矩阵中的极慢和行操作

here's how your sort of problem could be solved with sparse matrices. 这是使用稀疏矩阵可以解决您的问题的方式。 The method might apply just as well to dense ones. 该方法可能也适用于密集型方法。 The idea is that sparse sum implemented as matrix product with a row (or column) of 1s. 这个想法是将稀疏sum实现为行(或列)为1s的矩阵乘积。 Indexing of sparse matrices is slow, but the matrix product is good C code. 稀疏矩阵的索引编制很慢,但矩阵乘积是良好的C代码。

In this case I'm going to build a multiplication matrix that has 1s for the rows that I want to sum - different set of 1s for each entry in the dictionary. 在这种情况下,我将构建一个乘法矩阵,该矩阵对我要求和的行具有1s-字典中每个条目的1s集合均不同。

A sample matrix: 样本矩阵:

In [302]: A=np.arange(8*3).reshape(8,3)    
In [303]: M=sparse.csr_matrix(A)

selection dictionary: 选择字典:

In [304]: dict={0:[0,1],1:[1,0,2],2:[2,1],3:[3,4,7]}

build a sparse matrix from this dictionary. 从此字典构建一个稀疏矩阵。 This might not be the most efficient way of constructing such a matrix, but it is enough to demonstrate the idea. 这可能不是构造这样一个矩阵的最有效方法,但是足以证明这个想法。

In [305]: r,c,d=[],[],[]
In [306]: for i,col in dict.items():
    c.extend(col)
    r.extend([i]*len(col))
    d.extend([1]*len(col))

In [307]: r,c,d
Out[307]: 
([0, 0, 1, 1, 1, 2, 2, 3, 3, 3],
 [0, 1, 1, 0, 2, 2, 1, 3, 4, 7],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [308]: idx=sparse.csr_matrix((d,(r,c)),shape=(len(dict),M.shape[0]))

Perform the sum and look at the result (as a dense array): 执行总和并查看结果(作为密集数组):

In [310]: (idx*M).A
Out[310]: 
array([[ 3,  5,  7],
       [ 9, 12, 15],
       [ 9, 11, 13],
       [42, 45, 48]], dtype=int32)

Here's the original for comparison. 这是供比较的原稿。

In [312]: M.A
Out[312]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23]], dtype=int32)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM