在python中将稀疏向量添加到稀疏数组太慢

Question

I have a sparse matrix and I'm trying to add a sparse vector to it. 我有一个稀疏矩阵，我想向它添加一个稀疏向量。 I've tried different sparse formats, including csr, csc, lil, coo, and different ways of adding the sparse vector to sparse matrix, including vstack and concatenate. 我尝试了不同的稀疏格式，包括csr，csc，lil，coo，以及将稀疏矢量添加到稀疏矩阵的不同方式，包括vstack和concatenate。

All ways and formats turned out to be very slow. 事实证明，所有方式和格式都很慢。 But when I convert the vector to dense format (by todense() ) and append it to a dense matrix (numpy.ndarray specifically) it is done very quickly. 但是，当我将向量转换为密集格式（通过todense（））并将其附加到密集矩阵（特别是numpy.ndarray）时，它可以很快完成。 Why is it? 为什么？ Is there a trick or a suitable format for this that I'm missing? 是否有我不知道的技巧或合适的格式？

Here is my code for when I tried it with 'coo' format: 这是我尝试使用“ coo”格式的代码时的代码：

from scipy.sparse import coo_matrix, rand
from time import time as timer
from numpy import array, concatenate, empty

### sparse appending in coo way ####
def sparse_append(A):
    dim = A.shape[1]
    mat = coo_matrix((0, dim))

    sparse_addtime = 0

    for vector in A:
        st = timer() 

        row = coo_matrix(vector)
        newdata = concatenate((mat.data, row.data))
        newrows = concatenate((mat.row, row.row + mat.shape[0]))
        newcols = concatenate((mat.col, row.col))

        mat = coo_matrix((newdata, (newrows, newcols)), shape = ((mat.shape)[0]+1, (mat.shape)[1]))

        et = timer() 
        sparse_addtime += et-st

    return sparse_addtime

#### dense append ####
def dense_append(A):
    dim = A.shape[1]
    mat = empty([0,dim])

    dense_addtime = 0

    for vector in A:
        st = timer()
        mat = concatenate((mat,vector))
        et = timer()
        dense_addtime += et-st

    return dense_addtime



### main ####
if __name__ == '__main__':
    dim = 400
    n = 200

    A = rand(n, dim, density = 0.1, format='lil')
    B = A.todense() #numpy.ndarray

    t1 = sparse_append(A)
    t2 = dense_append(B)

    print t1, t2

Any help is appreciated. 任何帮助表示赞赏。

Answer 1

The slowest part in your sparse addition code is the row conversion. 稀疏附加代码中最慢的部分是行转换。

row = coo_matrix(vector)

This takes roughly 65% of the time when I run it. 我运行它大约需要花费65％的时间。 This is because it needs to change the storage format it is storing the data in. The other slow part is creating a matrix. 这是因为它需要更改用于存储数据的存储格式。另一个较慢的部分是创建矩阵。

mat = coo_matrix((newdata, (newrows, newcols)), shape = ((mat.shape)[0]+1, (mat.shape)[1]))

This takes a further 30% of the time. 这还需要30％的时间。 Every time you do this, you are copying all the data and allocating a bunch of memory. 每次执行此操作时，您都将复制所有数据并分配一堆内存。 The most efficient way of adding your rows, especially if they are already in lil format, is to modify the matrix. 添加行（尤其是如果行已经为lil格式）的最有效方法是修改矩阵。 If you know the matrix's dimensions at the start, you can just create the matrix with the right shape from the start. 如果您一开始就知道矩阵的尺寸，则可以从一开始就以正确的形状创建矩阵。 The sparse format is memory efficient, and empty rows are no issue. 稀疏格式提高了内存效率，并且没有空行。 Otherwise, you can use set_shape to increase the dimensions every time. 否则，您可以每次使用set_shape来增加尺寸。

from scipy.sparse import lil_matrix, rand
from time import time as timer
from numpy import array, concatenate, empty

### sparse appending ####
def sparse_append(A):
    dim = A.shape[1]
    mat = lil_matrix(A.shape, dtype = A.dtype)

    sparse_addtime = 0
    i = 0
    for vector in A:
        st = timer()

        mat[i] = vector
        i += 1
        et = timer() 
        sparse_addtime += et-st

    return sparse_addtime



#### dense append ####
def dense_append(A):
    dim = A.shape[1]
    mat = empty([0,dim])

    dense_addtime = 0

    for vector in A:
        st = timer()
        mat = concatenate((mat,vector))
        et = timer()
        dense_addtime += et-st

    return dense_addtime



### main ####
if __name__ == '__main__':
    dim = 400
    n = 200

    A = rand(n, dim, density = 0.1, format='lil')
    B = A.todense() #numpy.ndarray

    t1 = sparse_append(A)
    t2 = dense_append(B)

    print t1, t2

Running the code like this, I get slitghly better time from the sparse addition. 像这样运行代码，从稀疏添加中我得到了明显更好的时间。

在python中将稀疏向量添加到稀疏数组太慢

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-08-21 01:48:23

在python中将稀疏向量添加到稀疏数组太慢

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-08-21 01:48:23

解决方案1
0 已采纳 2015-08-21 01:48:23