简体   繁体   English

Scipy 稀疏矩阵循环永远存在 - 需要提高效率

[英]Scipy Sparse Matrix Loop Taking Forever - Need to make more efficient

What is the most efficient way time & memory wise of writing this loop with sparse matrix (currently using csc_matrix)用稀疏矩阵编写这个循环的最有效方式是什么?

for j in range(0, reducedsize):
    xs = sum(X[:, j])
    X[:, j] = X[:, j] / xs.data[0]

example:例子:

reduced size (int) - 2500缩小尺寸 (int) - 2500
X (csc_matrix) - 908x2500 X (csc_matrix) - 908x2500

The loop does iterate but it takes a very long time compared to just using numpy.该循环确实会迭代,但与仅使用 numpy 相比,它需要很长时间。

In [388]: from scipy import sparse                                                      

Make a sample matrix:制作样本矩阵:

In [390]: M = sparse.random(10,8,.2, 'csc')                                             

Matrix sum:矩阵总和:

In [393]: M.sum(axis=0)                                                                 
Out[393]: 
matrix([[1.95018736, 0.90924629, 1.93427113, 2.38816133, 1.08713479,
         0.        , 2.45435481, 0.        ]])

those 0's produce warning when dividing - and nan in the results:那些 0 在除法时会产生警告 - 和结果中的nan

In [394]: M/_                                                                           
/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py:599: RuntimeWarning: invalid value encountered in true_divide
  return np.true_divide(self.todense(), other)
Out[394]: 
matrix([[0.        , 0.        , 0.        , 0.        , 0.27079623,
                nan, 0.13752665,        nan],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
                nan, 0.32825122,        nan],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
                nan, 0.        ,        nan],
 ...
                nan, 0.        ,        nan]])

the 0s also give a problem with your approach: 0 也给您的方法带来了问题:

In [395]: for i in range(8): 
     ...:     xs = sum(M[:,i]) 
     ...:     M[:,i] = M[:,i]/xs.data[0] 
     ...:                                                                               
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-395-0195298ead19> in <module>
      1 for i in range(8):
      2     xs = sum(M[:,i])
----> 3     M[:,i] = M[:,i]/xs.data[0]
      4 

IndexError: index 0 is out of bounds for axis 0 with size 0

But if we compare the columns without 0 sum the values match:但是,如果我们比较没有 0 的列总和,则值匹配:

In [401]: Out[394][:,:5]                                                                
Out[401]: 
matrix([[0.        , 0.        , 0.        , 0.        , 0.27079623],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.49648886, 0.25626608, 0.        , 0.19162678, 0.72920377],
        [0.        , 0.        , 0.30200765, 0.        , 0.        ],
        [0.50351114, 0.        , 0.30445113, 0.41129367, 0.        ],
        [0.        , 0.74373392, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.39354122, 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.39707955, 0.        ]])
In [402]: M.A[:,:5]                                                                     
Out[402]: 
array([[0.        , 0.        , 0.        , 0.        , 0.27079623],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.49648886, 0.25626608, 0.        , 0.19162678, 0.72920377],
       [0.        , 0.        , 0.30200765, 0.        , 0.        ],
       [0.50351114, 0.        , 0.30445113, 0.41129367, 0.        ],
       [0.        , 0.74373392, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.39354122, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.39707955, 0.        ]])

Back in [394] I should have first converted the matrix sum to sparse, so the result will also be sparse.回到 [394] 我应该首先将矩阵总和转换为稀疏矩阵,因此结果也将是稀疏的。 Sparse doesn't have elementwise divide, so I had to take the dense matrix inverse first.稀疏没有元素划分,所以我必须先取密集矩阵逆。 The 0s are still a nuisance. 0 仍然令人讨厌。

In [409]: M.multiply(sparse.csr_matrix(1/Out[393]))                                     
...
Out[409]: 
<10x8 sparse matrix of type '<class 'numpy.float64'>'
    with 16 stored elements in Compressed Sparse Column format>

If you want to do it without any memory overhead (in-place)如果您想在没有任何 memory 开销的情况下执行此操作(就地)

Always think on how the data is actually stored.始终考虑数据的实际存储方式。 A small example on a csc matrix . csc 矩阵上的一个小例子。

shape=(5,5)
X=sparse.random(shape[0], shape[1], density=0.5, format='csc')
print(X.todense())

[[0.12146814 0.         0.         0.04075121 0.28749552]
 [0.         0.92208639 0.         0.44279661 0.        ]
 [0.63509196 0.42334964 0.         0.         0.99160443]
 [0.         0.         0.25941113 0.44669367 0.00389409]
 [0.         0.         0.         0.         0.83226886]]

i=0 #first column
print(X.data[X.indptr[i]:X.indptr[i+1]])
[0.12146814 0.63509196]

A Numpy solution Numpy解决方案

So the only thing we want to do here is to modify the nonzero entries column by column in place.所以我们在这里唯一要做的就是逐列修改非零条目。 This can be easily done using a partly vectorized numpy solution.这可以使用部分矢量化的 numpy 解决方案轻松完成。 data is just the array which contains all non zero values, indptr stores the information where each column begins and ends. data只是包含所有非零值的数组, indptr存储每列开始和结束的信息。

def Numpy_csc_norm(data,indptr):
    for i in range(indptr.shape[0]-1):
        xs = np.sum(data[indptr[i]:indptr[i+1]])
        #Modify the view in place
        data[indptr[i]:indptr[i+1]]/=xs

Regarding performance this in-place solution is already not too bad.关于性能,这个就地解决方案已经不算太糟糕了。 If you want to improve the performance further you could use Cython/Numba/ or some other compiled code which can be wrapped up in Python more or less easily.如果您想进一步提高性能,您可以使用 Cython/Numba/ 或其他一些可以或多或少轻松地包含在 Python 中的编译代码。

A Numba solution一个 Numba 解决方案

@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def Numba_csc_norm(data,indptr):
    for i in nb.prange(indptr.shape[0]-1):
        acc=0
        for j in range(indptr[i],indptr[i+1]):
            acc+=data[j]
        for j in range(indptr[i],indptr[i+1]):
            data[j]/=acc

Performance表现

#Create a not to small example matrix
shape=(50_000,10_000)
X=sparse.random(shape[0], shape[1], density=0.001, format='csc')

#Not in-place from hpaulj
def hpaulj(X):
    acc=X.sum(axis=0)
    return X.multiply(sparse.csr_matrix(1./acc))

%timeit X2=hpaulj(X)
#6.54 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#All 2 variants are in-place, 
#but this shouldn't have a influence on the timings

%timeit Numpy_csc_norm(X.data,X.indptr)
#79.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#parallel=False -> faster on tiny matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#626 µs ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#parallel=True -> faster on larger matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#185 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM