[英]Split sparse matrix by rows
I have a scipy.sparse.csr.csr_matrix
of (8723, 1741277)
dimensions. 我有一个
scipy.sparse.csr.csr_matrix
的(8723, 1741277)
尺寸。
How can I efficiently split it in n chunks by rows? 如何有效地将它分成n行?
It is better that the chunks are approximately equal in terms of the number of rows. 在行数方面,块最好大致相等。
I am saying approximately because it depends on whether (number of rows)/(number of chunks) gives back any remainder. 我说大概是因为它取决于(行数)/(块数)是否给出了任何余数。
I think that you can easily do this in with numpy.split
for arrays but it does not seem to work for sparse matrices. 我认为您可以使用
numpy.split
为数组轻松完成此操作,但它似乎不适用于稀疏矩阵。
Specifically, I get this error if I choose n-chunks number which is not perfectly divisible with 8723: 具体来说,如果我选择的n-chunks数字不能与8723完全整除,我会收到此错误:
ValueError: array split does not result in an equal division
and I get this error if I choose n-chunks number which is perfectly divisible with 8723: 如果我选择与8723完全可分的n块数字,我会收到此错误:
AxisError: axis1: axis 0 is out of bounds for array of dimension 0
The reason I want to split the sparse matrix in chunks is because I want to convert my sparse matrix to a (dense) array but I cannot directly do it because it is too big as a whole. 我想在块中拆分稀疏矩阵的原因是因为我想将我的稀疏矩阵转换为(密集)数组,但我不能直接这样做,因为它总体上太大了。
In [6]: from scipy import sparse
In [7]: M = sparse.random(12,3,.1,'csr')
In [8]: np.split?
In [9]: np.split(M,3)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
55 try:
---> 56 return getattr(obj, method)(*args, **kwds)
57
/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
687 else:
--> 688 raise AttributeError(attr + " not found")
689
AttributeError: swapaxes not found
During handling of the above exception, another exception occurred:
AxisError Traceback (most recent call last)
<ipython-input-9-11a4dcdd89af> in <module>
----> 1 np.split(M,3)
/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
848 raise ValueError(
849 'array split does not result in an equal division')
--> 850 res = array_split(ary, indices_or_sections, axis)
851 return res
852
/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in array_split(ary, indices_or_sections, axis)
760
761 sub_arys = []
--> 762 sary = _nx.swapaxes(ary, axis, 0)
763 for i in range(Nsections):
764 st = div_points[i]
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in swapaxes(a, axis1, axis2)
583
584 """
--> 585 return _wrapfunc(a, 'swapaxes', axis1, axis2)
586
587
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
64 # a downstream library like 'pandas'.
65 except (AttributeError, TypeError):
---> 66 return _wrapit(obj, method, *args, **kwds)
67
68
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
44 except AttributeError:
45 wrap = None
---> 46 result = getattr(asarray(obj), method)(*args, **kwds)
47 if wrap:
48 if not isinstance(result, mu.ndarray):
AxisError: axis1: axis 0 is out of bounds for array of dimension 0
If we apply np.array
to M
we get a 0d object array; 如果我们将
np.array
应用于M
我们得到一个0d对象数组; just a naive wrapper around the sparse object. 只是稀疏物体周围的天真包装物。
In [10]: np.array(M)
Out[10]:
array(<12x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>, dtype=object)
In [11]: _.shape
Out[11]: ()
split a correct dense equivalent: 拆分正确的密集等价物:
In [12]: np.split(M.A,3)
Out[12]:
[array([[0. , 0.61858517, 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ]]), array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]), array([[0. , 0.89573059, 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0.02334738],
[0. , 0. , 0. ]])]
and a direct sparse split: 和直接稀疏分裂:
In [13]: [M[i:j,:] for i,j in zip([0,4,8],[4,8,12])]
Out[13]:
[<4x3 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>,
<4x3 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>,
<4x3 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>]
Slicing like this isn't as efficient with sparse matrices as with dense ones. 对于稀疏矩阵而言,像这样的切片不如使用密集矩阵那样有效。 Dense slices are views.
密集切片是视图。 Sparse ones must be copies.
稀疏的必须是副本。 The only exception is the
lil
format, which has a get_rowview
method. 唯一的例外是
lil
格式,它有一个get_rowview
方法。 While there are many functions for constructing sparse matrices from pieces, there isn't much need for functions that split them up. 虽然有很多函数可以从片段构造稀疏矩阵,但是不需要将它们分开的函数。
It is possible the sklearn
has some splitting functions. sklearn
可能具有一些分裂功能。 It has some sparse utility functions that address its own uses of sparse matrices. 它有一些稀疏的实用函数,可以解决它自己对稀疏矩阵的使用问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.