[英]Scipy Sparse Matrix Loop Taking Forever - Need to make more efficient
What is the most efficient way time & memory wise of writing this loop with sparse matrix (currently using csc_matrix)用稀疏矩阵编写这个循环的最有效方式是什么?
for j in range(0, reducedsize):
xs = sum(X[:, j])
X[:, j] = X[:, j] / xs.data[0]
example:例子:
reduced size (int) - 2500缩小尺寸 (int) - 2500
X (csc_matrix) - 908x2500 X (csc_matrix) - 908x2500
The loop does iterate but it takes a very long time compared to just using numpy.该循环确实会迭代,但与仅使用 numpy 相比,它需要很长时间。
In [388]: from scipy import sparse
Make a sample matrix:制作样本矩阵:
In [390]: M = sparse.random(10,8,.2, 'csc')
Matrix sum:矩阵总和:
In [393]: M.sum(axis=0)
Out[393]:
matrix([[1.95018736, 0.90924629, 1.93427113, 2.38816133, 1.08713479,
0. , 2.45435481, 0. ]])
those 0's produce warning when dividing - and nan
in the results:那些 0 在除法时会产生警告 - 和结果中的
nan
:
In [394]: M/_
/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py:599: RuntimeWarning: invalid value encountered in true_divide
return np.true_divide(self.todense(), other)
Out[394]:
matrix([[0. , 0. , 0. , 0. , 0.27079623,
nan, 0.13752665, nan],
[0. , 0. , 0. , 0. , 0. ,
nan, 0.32825122, nan],
[0. , 0. , 0. , 0. , 0. ,
nan, 0. , nan],
...
nan, 0. , nan]])
the 0s also give a problem with your approach: 0 也给您的方法带来了问题:
In [395]: for i in range(8):
...: xs = sum(M[:,i])
...: M[:,i] = M[:,i]/xs.data[0]
...:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-395-0195298ead19> in <module>
1 for i in range(8):
2 xs = sum(M[:,i])
----> 3 M[:,i] = M[:,i]/xs.data[0]
4
IndexError: index 0 is out of bounds for axis 0 with size 0
But if we compare the columns without 0 sum the values match:但是,如果我们比较没有 0 的列总和,则值匹配:
In [401]: Out[394][:,:5]
Out[401]:
matrix([[0. , 0. , 0. , 0. , 0.27079623],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.49648886, 0.25626608, 0. , 0.19162678, 0.72920377],
[0. , 0. , 0.30200765, 0. , 0. ],
[0.50351114, 0. , 0.30445113, 0.41129367, 0. ],
[0. , 0.74373392, 0. , 0. , 0. ],
[0. , 0. , 0.39354122, 0. , 0. ],
[0. , 0. , 0. , 0.39707955, 0. ]])
In [402]: M.A[:,:5]
Out[402]:
array([[0. , 0. , 0. , 0. , 0.27079623],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.49648886, 0.25626608, 0. , 0.19162678, 0.72920377],
[0. , 0. , 0.30200765, 0. , 0. ],
[0.50351114, 0. , 0.30445113, 0.41129367, 0. ],
[0. , 0.74373392, 0. , 0. , 0. ],
[0. , 0. , 0.39354122, 0. , 0. ],
[0. , 0. , 0. , 0.39707955, 0. ]])
Back in [394] I should have first converted the matrix sum to sparse, so the result will also be sparse.回到 [394] 我应该首先将矩阵总和转换为稀疏矩阵,因此结果也将是稀疏的。 Sparse doesn't have elementwise divide, so I had to take the dense matrix inverse first.
稀疏没有元素划分,所以我必须先取密集矩阵逆。 The 0s are still a nuisance.
0 仍然令人讨厌。
In [409]: M.multiply(sparse.csr_matrix(1/Out[393]))
...
Out[409]:
<10x8 sparse matrix of type '<class 'numpy.float64'>'
with 16 stored elements in Compressed Sparse Column format>
Always think on how the data is actually stored.始终考虑数据的实际存储方式。 A small example on a csc matrix .
csc 矩阵上的一个小例子。
shape=(5,5)
X=sparse.random(shape[0], shape[1], density=0.5, format='csc')
print(X.todense())
[[0.12146814 0. 0. 0.04075121 0.28749552]
[0. 0.92208639 0. 0.44279661 0. ]
[0.63509196 0.42334964 0. 0. 0.99160443]
[0. 0. 0.25941113 0.44669367 0.00389409]
[0. 0. 0. 0. 0.83226886]]
i=0 #first column
print(X.data[X.indptr[i]:X.indptr[i+1]])
[0.12146814 0.63509196]
A Numpy solution Numpy解决方案
So the only thing we want to do here is to modify the nonzero entries column by column in place.所以我们在这里唯一要做的就是逐列修改非零条目。 This can be easily done using a partly vectorized numpy solution.
这可以使用部分矢量化的 numpy 解决方案轻松完成。
data
is just the array which contains all non zero values, indptr
stores the information where each column begins and ends. data
只是包含所有非零值的数组, indptr
存储每列开始和结束的信息。
def Numpy_csc_norm(data,indptr):
for i in range(indptr.shape[0]-1):
xs = np.sum(data[indptr[i]:indptr[i+1]])
#Modify the view in place
data[indptr[i]:indptr[i+1]]/=xs
Regarding performance this in-place solution is already not too bad.关于性能,这个就地解决方案已经不算太糟糕了。 If you want to improve the performance further you could use Cython/Numba/ or some other compiled code which can be wrapped up in Python more or less easily.
如果您想进一步提高性能,您可以使用 Cython/Numba/ 或其他一些可以或多或少轻松地包含在 Python 中的编译代码。
A Numba solution一个 Numba 解决方案
@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def Numba_csc_norm(data,indptr):
for i in nb.prange(indptr.shape[0]-1):
acc=0
for j in range(indptr[i],indptr[i+1]):
acc+=data[j]
for j in range(indptr[i],indptr[i+1]):
data[j]/=acc
Performance表现
#Create a not to small example matrix
shape=(50_000,10_000)
X=sparse.random(shape[0], shape[1], density=0.001, format='csc')
#Not in-place from hpaulj
def hpaulj(X):
acc=X.sum(axis=0)
return X.multiply(sparse.csr_matrix(1./acc))
%timeit X2=hpaulj(X)
#6.54 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#All 2 variants are in-place,
#but this shouldn't have a influence on the timings
%timeit Numpy_csc_norm(X.data,X.indptr)
#79.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#parallel=False -> faster on tiny matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#626 µs ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#parallel=True -> faster on larger matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#185 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.