简体   繁体   English

scipy稀疏矩阵:删除所有元素为零的行

[英]scipy sparse matrix: remove the rows whose all elements are zero

I have a sparse matrix which is transformed from sklearn tfidfVectorier. 我有一个稀疏矩阵,它是从sklearn tfidfVectorier转换而来的。 I believe that some rows are all-zero rows. 我相信有些行是全零行。 I want to remove them. 我想删除它们。 However, as far as I know, the existing built-in functions, eg nonzero() and eliminate_zero(), focus on zero entries, rather than rows. 但是,据我所知,现有的内置函数,例如nonzero()和eliminate_zero(),专注于零条目而不是行。

Is there any easy way to remove all-zero rows from a sparse matrix? 有没有简单的方法从稀疏矩阵中删除全零行?

Example: What I have now (actually in sparse format): 示例:我现在拥有的(实际上是稀疏格式):

[ [0, 0, 0]
  [1, 0, 2]
  [0, 0, 1] ]

What I want to get: 我想得到什么:

[ [1, 0, 2]
  [0, 0, 1] ]

Slicing + getnnz() does the trick: 切片+ getnnz()可以解决问题:

M = M[M.getnnz(1)>0]

Works directly on csr_array . 直接在csr_arraycsr_array You can also remove all 0 columns without changing formats: 您也可以删除所有0列而不更改格式:

M = M[:,M.getnnz(0)>0]

However if you want to remove both you need 但是,如果你想删除你需要的两个

M = M[M.getnnz(1)>0][:,M.getnnz(0)>0] #GOOD

I am not sure why but 我不知道为什么但是

M = M[M.getnnz(1)>0, M.getnnz(0)>0] #BAD

does not work. 不起作用。

There aren't existing functions for this, but it's not too bad to write your own: 没有现有的功能,但编写自己的功能并不算太糟糕:

def remove_zero_rows(M):
  M = scipy.sparse.csr_matrix(M)

First, convert the matrix to CSR (compressed sparse row) format. 首先,将矩阵转换为CSR(压缩稀疏行)格式。 This is important because CSR matrices store their data as a triple of (data, indices, indptr) , where data holds the nonzero values, indices stores column indices, and indptr holds row index information. 这很重要,因为CSR矩阵将其数据存储为三个(data, indices, indptr) ,其中data保存非零值, indices存储列索引, indptr保存行索引信息。 The docs explain better: 文档解释得更好:

the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]] . 行i的列索引存储在indices[indptr[i]:indptr[i+1]] ,它们的对应值存储在data[indptr[i]:indptr[i+1]]

So, to find rows without any nonzero values, we can just look at successive values of M.indptr . 因此,要查找没有任何非零值的行,我们只需查看M.indptr连续值M.indptr Continuing our function from above: 从上面继续我们的功能:

  num_nonzeros = np.diff(M.indptr)
  return M[num_nonzeros != 0]

The second benefit of CSR format here is that it's relatively cheap to slice rows, which simplifies the creation of the resulting matrix. CSR格式的第二个好处是切片行相对便宜,这简化了生成矩阵的创建。

Thanks for your reply, @perimosocordiae 谢谢你的回复@perimosocordiae

I just find another solution by myself. 我自己找到另一种解决方案。 I am posting here in case someone may need it in the future. 我发布在这里以防将来有人可能需要它。

def remove_zero_rows(X)
    # X is a scipy sparse matrix. We want to remove all zero rows from it
    nonzero_row_indice, _ = X.nonzero()
    unique_nonzero_indice = numpy.unique(nonzero_row_indice)
    return X[unique_nonzero_indice]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM