简体   繁体   English

从 Scipy 稀疏矩阵中获取唯一行

[英]Get unique rows from a Scipy sparse matrix

I'm working with sparse matrices in python, I wonder if there is an efficient way to remove duplicate rows in a sparse matrix, and have only the unique rows remain.我正在使用 python 中的稀疏矩阵,我想知道是否有一种有效的方法来删除稀疏矩阵中的重复行,并且只保留唯一的行。

I did not find a function associated with it and not sure how to do it without converting the sparse matrix to dense and use numpy.unique.我没有找到与之相关的函数,也不知道如何在不将稀疏矩阵转换为密集矩阵并使用 numpy.unique 的情况下进行操作。

There is no quick way to do it, so I had to write a function.没有快速的方法来做到这一点,所以我不得不写一个函数。 It returns a sparse matrix with the unique rows (axis=0) or columns (axis=1) of an input sparse matrix.它返回一个稀疏矩阵,其中包含输入稀疏矩阵的唯一行(轴 = 0)或列(轴 = 1)。 Note that the unique rows or columns of the returned matrix are not lexicographical sorted (as is the case with the np.unique ).请注意,返回矩阵的唯一行或列不是按字典顺序排序的(与np.unique的情况np.unique )。

import numpy as np
import scipy.sparse as sp

def sp_unique(sp_matrix, axis=0):
    ''' Returns a sparse matrix with the unique rows (axis=0)
    or columns (axis=1) of an input sparse matrix sp_matrix'''
    if axis == 1:
        sp_matrix = sp_matrix.T

    old_format = sp_matrix.getformat()
    dt = np.dtype(sp_matrix)
    ncols = sp_matrix.shape[1]

    if old_format != 'lil':
        sp_matrix = sp_matrix.tolil()

    _, ind = np.unique(sp_matrix.data + sp_matrix.rows, return_index=True)
    rows = sp_matrix.rows[ind]
    data = sp_matrix.data[ind]
    nrows_uniq = data.shape[0]

    sp_matrix = sp.lil_matrix((nrows_uniq, ncols), dtype=dt)  #  or sp_matrix.resize(nrows_uniq, ncols)
    sp_matrix.data = data
    sp_matrix.rows = rows

    ret = sp_matrix.asformat(old_format)
    if axis == 1:
        ret = ret.T        
    return ret


def lexsort_row(A):
    ''' numpy lexsort of the rows, not used in sp_unique'''
    return A[np.lexsort(A.T[::-1])]

if __name__ == '__main__':    
    # Test
    # Create a large sparse matrix with elements in [0, 10]
    A = 10*sp.random(10000, 3, 0.5, format='csr')
    A = np.ceil(A).astype(int)

    # unique rows
    A_uniq = sp_unique(A, axis=0).toarray()
    A_uniq = lexsort_row(A_uniq)
    A_uniq_numpy = np.unique(A.toarray(), axis=0)
    assert (A_uniq == A_uniq_numpy).all()

    # unique columns
    A_uniq = sp_unique(A, axis=1).toarray()
    A_uniq = lexsort_row(A_uniq.T).T
    A_uniq_numpy = np.unique(A.toarray(), axis=1)
    assert (A_uniq == A_uniq_numpy).all()  

One could also use slicing也可以使用切片

def remove_duplicate_rows(data):
    unique_row_indices, unique_columns = [], []
    for row_idx, row in enumerate(data):
        indices = row.indices.tolist()
        if indices not in unique_columns:
            unique_columns.append(indices)
            unique_row_indices.append(row_idx)
    return data[unique_row_indices]

I found this especially helpful when I was in a supervised machine-learning setting.当我处于受监督的机器学习环境中时,我发现这特别有用。 There, the input to my function was data and labels.在那里,我的函数的输入是数据和标签。 With this approach, I could easily return通过这种方法,我可以轻松返回

labels[unique_row_indices]

aswell to make sure data and labels are on-par after this clean-up.还要确保在此清理后数据和标签是一致的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM