簡體   English   中英

按行拆分稀疏矩陣

[英]Split sparse matrix by rows

我有一個scipy.sparse.csr.csr_matrix(8723, 1741277)尺寸。

如何有效地將它分成n行?

在行數方面,塊最好大致相等。

我說大概是因為它取決於(行數)/(塊數)是否給出了任何余數。

我認為您可以使用numpy.split為數組輕松完成此操作,但它似乎不適用於稀疏矩陣。

具體來說,如果我選擇的n-chunks數字不能與8723完全整除,我會收到此錯誤:

ValueError: array split does not result in an equal division

如果我選擇與8723完全可分的n塊數字,我會收到此錯誤:

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

我想在塊中拆分稀疏矩陣的原因是因為我想將我的稀疏矩陣轉換為(密集)數組,但我不能直接這樣做,因為它總體上太大了。

In [6]: from scipy import sparse                                                                     
In [7]: M = sparse.random(12,3,.1,'csr')                                                             
In [8]: np.split?                                                                                    
In [9]: np.split(M,3)                                                                                
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57 

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
    687         else:
--> 688             raise AttributeError(attr + " not found")
    689 

AttributeError: swapaxes not found

During handling of the above exception, another exception occurred:

AxisError                                 Traceback (most recent call last)
<ipython-input-9-11a4dcdd89af> in <module>
----> 1 np.split(M,3)

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
    848             raise ValueError(
    849                 'array split does not result in an equal division')
--> 850     res = array_split(ary, indices_or_sections, axis)
    851     return res
    852 

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in array_split(ary, indices_or_sections, axis)
    760 
    761     sub_arys = []
--> 762     sary = _nx.swapaxes(ary, axis, 0)
    763     for i in range(Nsections):
    764         st = div_points[i]

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in swapaxes(a, axis1, axis2)
    583 
    584     """
--> 585     return _wrapfunc(a, 'swapaxes', axis1, axis2)
    586 
    587 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     64     # a downstream library like 'pandas'.
     65     except (AttributeError, TypeError):
---> 66         return _wrapit(obj, method, *args, **kwds)
     67 
     68 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     44     except AttributeError:
     45         wrap = None
---> 46     result = getattr(asarray(obj), method)(*args, **kwds)
     47     if wrap:
     48         if not isinstance(result, mu.ndarray):

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

如果我們將np.array應用於M我們得到一個0d對象數組; 只是稀疏物體周圍的天真包裝物。

In [10]: np.array(M)                                                                                 
Out[10]: 
array(<12x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>, dtype=object)
In [11]: _.shape                                                                                     
Out[11]: ()

拆分正確的密集等價物:

In [12]: np.split(M.A,3)                                                                             
Out[12]: 
[array([[0.        , 0.61858517, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ]]), array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]), array([[0.        , 0.89573059, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.02334738],
        [0.        , 0.        , 0.        ]])]

和直接稀疏分裂:

In [13]: [M[i:j,:] for i,j in zip([0,4,8],[4,8,12])]                                                 
Out[13]: 
[<4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>]

對於稀疏矩陣而言,像這樣的切片不如使用密集矩陣那樣有效。 密集切片是視圖。 稀疏的必須是副本。 唯一的例外是lil格式,它有一個get_rowview方法。 雖然有很多函數可以從片段構造稀疏矩陣,但是不需要將它們分開的函數。

sklearn可能具有一些分裂功能。 它有一些稀疏的實用函數,可以解決它自己對稀疏矩陣的使用問題。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM