简体   繁体   English

按行拆分稀疏矩阵

[英]Split sparse matrix by rows

I have a scipy.sparse.csr.csr_matrix of (8723, 1741277) dimensions. 我有一个scipy.sparse.csr.csr_matrix(8723, 1741277)尺寸。

How can I efficiently split it in n chunks by rows? 如何有效地将它分成n行?

It is better that the chunks are approximately equal in terms of the number of rows. 在行数方面,块最好大致相等。

I am saying approximately because it depends on whether (number of rows)/(number of chunks) gives back any remainder. 我说大概是因为它取决于(行数)/(块数)是否给出了任何余数。

I think that you can easily do this in with numpy.split for arrays but it does not seem to work for sparse matrices. 我认为您可以使用numpy.split为数组轻松完成此操作,但它似乎不适用于稀疏矩阵。

Specifically, I get this error if I choose n-chunks number which is not perfectly divisible with 8723: 具体来说,如果我选择的n-chunks数字不能与8723完全整除,我会收到此错误:

ValueError: array split does not result in an equal division

and I get this error if I choose n-chunks number which is perfectly divisible with 8723: 如果我选择与8723完全可分的n块数字,我会收到此错误:

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

The reason I want to split the sparse matrix in chunks is because I want to convert my sparse matrix to a (dense) array but I cannot directly do it because it is too big as a whole. 我想在块中拆分稀疏矩阵的原因是因为我想将我的稀疏矩阵转换为(密集)数组,但我不能直接这样做,因为它总体上太大了。

In [6]: from scipy import sparse                                                                     
In [7]: M = sparse.random(12,3,.1,'csr')                                                             
In [8]: np.split?                                                                                    
In [9]: np.split(M,3)                                                                                
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57 

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
    687         else:
--> 688             raise AttributeError(attr + " not found")
    689 

AttributeError: swapaxes not found

During handling of the above exception, another exception occurred:

AxisError                                 Traceback (most recent call last)
<ipython-input-9-11a4dcdd89af> in <module>
----> 1 np.split(M,3)

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
    848             raise ValueError(
    849                 'array split does not result in an equal division')
--> 850     res = array_split(ary, indices_or_sections, axis)
    851     return res
    852 

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in array_split(ary, indices_or_sections, axis)
    760 
    761     sub_arys = []
--> 762     sary = _nx.swapaxes(ary, axis, 0)
    763     for i in range(Nsections):
    764         st = div_points[i]

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in swapaxes(a, axis1, axis2)
    583 
    584     """
--> 585     return _wrapfunc(a, 'swapaxes', axis1, axis2)
    586 
    587 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     64     # a downstream library like 'pandas'.
     65     except (AttributeError, TypeError):
---> 66         return _wrapit(obj, method, *args, **kwds)
     67 
     68 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     44     except AttributeError:
     45         wrap = None
---> 46     result = getattr(asarray(obj), method)(*args, **kwds)
     47     if wrap:
     48         if not isinstance(result, mu.ndarray):

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

If we apply np.array to M we get a 0d object array; 如果我们将np.array应用于M我们得到一个0d对象数组; just a naive wrapper around the sparse object. 只是稀疏物体周围的天真包装物。

In [10]: np.array(M)                                                                                 
Out[10]: 
array(<12x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>, dtype=object)
In [11]: _.shape                                                                                     
Out[11]: ()

split a correct dense equivalent: 拆分正确的密集等价物:

In [12]: np.split(M.A,3)                                                                             
Out[12]: 
[array([[0.        , 0.61858517, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ]]), array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]), array([[0.        , 0.89573059, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.02334738],
        [0.        , 0.        , 0.        ]])]

and a direct sparse split: 和直接稀疏分裂:

In [13]: [M[i:j,:] for i,j in zip([0,4,8],[4,8,12])]                                                 
Out[13]: 
[<4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>]

Slicing like this isn't as efficient with sparse matrices as with dense ones. 对于稀疏矩阵而言,像这样的切片不如使用密集矩阵那样有效。 Dense slices are views. 密集切片是视图。 Sparse ones must be copies. 稀疏的必须是副本。 The only exception is the lil format, which has a get_rowview method. 唯一的例外是lil格式,它有一个get_rowview方法。 While there are many functions for constructing sparse matrices from pieces, there isn't much need for functions that split them up. 虽然有很多函数可以从片段构造稀疏矩阵,但是不需要将它们分开的函数。

It is possible the sklearn has some splitting functions. sklearn可能具有一些分裂功能。 It has some sparse utility functions that address its own uses of sparse matrices. 它有一些稀疏的实用函数,可以解决它自己对稀疏矩阵的使用问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM