简体   繁体   中英

Adding a sparse vector to a sparse array in python is too slow

I have a sparse matrix and I'm trying to add a sparse vector to it. I've tried different sparse formats, including csr, csc, lil, coo, and different ways of adding the sparse vector to sparse matrix, including vstack and concatenate.

All ways and formats turned out to be very slow. But when I convert the vector to dense format (by todense() ) and append it to a dense matrix (numpy.ndarray specifically) it is done very quickly. Why is it? Is there a trick or a suitable format for this that I'm missing?

Here is my code for when I tried it with 'coo' format:

from scipy.sparse import coo_matrix, rand
from time import time as timer
from numpy import array, concatenate, empty

### sparse appending in coo way ####
def sparse_append(A):
    dim = A.shape[1]
    mat = coo_matrix((0, dim))

    sparse_addtime = 0

    for vector in A:
        st = timer() 

        row = coo_matrix(vector)
        newdata = concatenate((mat.data, row.data))
        newrows = concatenate((mat.row, row.row + mat.shape[0]))
        newcols = concatenate((mat.col, row.col))

        mat = coo_matrix((newdata, (newrows, newcols)), shape = ((mat.shape)[0]+1, (mat.shape)[1]))

        et = timer() 
        sparse_addtime += et-st

    return sparse_addtime

#### dense append ####
def dense_append(A):
    dim = A.shape[1]
    mat = empty([0,dim])

    dense_addtime = 0

    for vector in A:
        st = timer()
        mat = concatenate((mat,vector))
        et = timer()
        dense_addtime += et-st

    return dense_addtime



### main ####
if __name__ == '__main__':
    dim = 400
    n = 200

    A = rand(n, dim, density = 0.1, format='lil')
    B = A.todense() #numpy.ndarray

    t1 = sparse_append(A)
    t2 = dense_append(B)

    print t1, t2

Any help is appreciated.

The slowest part in your sparse addition code is the row conversion.

row = coo_matrix(vector)

This takes roughly 65% of the time when I run it. This is because it needs to change the storage format it is storing the data in. The other slow part is creating a matrix.

mat = coo_matrix((newdata, (newrows, newcols)), shape = ((mat.shape)[0]+1, (mat.shape)[1]))

This takes a further 30% of the time. Every time you do this, you are copying all the data and allocating a bunch of memory. The most efficient way of adding your rows, especially if they are already in lil format, is to modify the matrix. If you know the matrix's dimensions at the start, you can just create the matrix with the right shape from the start. The sparse format is memory efficient, and empty rows are no issue. Otherwise, you can use set_shape to increase the dimensions every time.

from scipy.sparse import lil_matrix, rand
from time import time as timer
from numpy import array, concatenate, empty

### sparse appending ####
def sparse_append(A):
    dim = A.shape[1]
    mat = lil_matrix(A.shape, dtype = A.dtype)

    sparse_addtime = 0
    i = 0
    for vector in A:
        st = timer()

        mat[i] = vector
        i += 1
        et = timer() 
        sparse_addtime += et-st

    return sparse_addtime



#### dense append ####
def dense_append(A):
    dim = A.shape[1]
    mat = empty([0,dim])

    dense_addtime = 0

    for vector in A:
        st = timer()
        mat = concatenate((mat,vector))
        et = timer()
        dense_addtime += et-st

    return dense_addtime



### main ####
if __name__ == '__main__':
    dim = 400
    n = 200

    A = rand(n, dim, density = 0.1, format='lil')
    B = A.todense() #numpy.ndarray

    t1 = sparse_append(A)
    t2 = dense_append(B)

    print t1, t2

Running the code like this, I get slitghly better time from the sparse addition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM