如何获得稀疏矩阵数据数组的对角元素的索引

Question

I have a sparse matrix in csr format, eg: 我有csr格式的稀疏矩阵，例如：

>>> a = sp.random(3, 3, 0.6, format='csr')  # an example
>>> a.toarray()  # just to see how it looks like
array([[0.31975333, 0.88437035, 0.        ],
       [0.        , 0.        , 0.        ],
       [0.14013856, 0.56245834, 0.62107962]])
>>> a.data  # data array
array([0.31975333, 0.88437035, 0.14013856, 0.56245834, 0.62107962])

For this particular example, I want to get [0, 4] which are the data-array indices of the non-zero diagonal elements 0.31975333 and 0.62107962 . 对于此特定示例，我想获得[0, 4] ，它们是非零对角元素0.31975333和0.62107962的数据数组索引。

A simple way to do this is the following: 一种简单的方法如下：

ind = []
seen = set()
for i, val in enumerate(a.data):
    if val in a.diagonal() and val not in seen:
        ind.append(i)
        seen.add(val)

But in practice the matrix is very big, so I don't want to use the for loops or convert to numpy array using toarray() method. 但实际上，矩阵很大，因此我不想使用for循环或使用toarray()方法转换为numpy数组。 Is there a more efficient way to do it? 有更有效的方法吗？

Edit : I just realized that the above code gives incorrect result in cases when there are off-diagonal elements equal to and preceding some of the diagonal elements: it returns the indices of that off-diagonal element. 编辑：我刚刚意识到，当存在非对角线元素等于和位于一些对角线元素之前的情况下，以上代码给出了错误的结果：它返回该非对角线元素的索引。 Also, it doesn't return the indices of repeating diagonal elements. 同样，它不返回重复对角元素的索引。 For example: 例如：

a = np.array([[0.31975333, 0.88437035, 0.        ],
              [0.62107962, 0.31975333, 0.        ],
              [0.14013856, 0.56245834, 0.62107962]])
a = sp.csr_matrix(a)

>>> a.data
array([0.31975333, 0.88437035, 0.62107962, 0.31975333, 0.14013856,
       0.56245834, 0.62107962])

My code returns ind = [0, 2] , but it should be [0, 3, 6] . 我的代码返回ind = [0, 2] ，但应为[0, 3, 6] 。 The code provided by Andras Deak (his get_rowwise function), returns the correct result. Andras Deak提供的代码（他的get_rowwise函数）返回正确的结果。

Answer 1

I've found a possibly more efficient solution, though it still loops. 我发现了一个可能更有效的解决方案，尽管它仍在循环。 However, it loops over the rows of the matrix rather than on the elements themselves. 但是，它在矩阵的行上而不是元素本身上循环。 Depending on the sparsity pattern of your matrix this might or might not be faster. 根据矩阵的稀疏模式，此速度可能会更快，也可能不会更快。 This is guaranteed to cost N iterations for a sparse matrix with N rows. 对于具有N行的稀疏矩阵，这可以保证花费N次迭代。

We just loop through each row, fetch the filled column indices via a.indices and a.indptr , and if the diagonal element for the given row is present in the filled values then we compute its index: 我们只遍历每一行，通过a.indices和a.indptr获取填充的列索引，如果给定行的对角线元素出现在填充值中，则我们计算其索引：

import numpy as np
import scipy.sparse as sp

def orig_loopy(a):
    ind = []
    seen = set()
    for i, val in enumerate(a.data):
        if val in a.diagonal() and val not in seen:
            ind.append(i)
            seen.add(val)
    return ind

def get_rowwise(a):
    datainds = []
    indices = a.indices # column indices of filled values
    indptr = a.indptr   # auxiliary "pointer" to data indices
    for irow in range(a.shape[0]):
        rowinds = indices[indptr[irow]:indptr[irow+1]] # column indices of the row
        if irow in rowinds:
            # then we've got a diagonal in this row
            # so let's find its index
            datainds.append(indptr[irow] + np.flatnonzero(irow == rowinds)[0])
    return datainds

a = sp.random(300, 300, 0.6, format='csr')
orig_loopy(a) == get_rowwise(a) # True

For a (300,300) -shaped random input with the same density the original version runs in 3.7 seconds, the new version runs in 5.5 milliseconds. 对于具有相同密度的(300,300)形随机输入，原始版本在3.7秒内运行，新版本在5.5毫秒内运行。

Answer 2

Method 1 方法1

This is a vectorized approach, which generates all nonzero indices first and than gets the positions where row and column index is the same. 这是一种矢量化方法，该方法首先生成所有非零索引，然后获取行索引和列索引相同的位置。 This is a bit slow and has a high memory usage. 这有点慢，并且内存使用率很高。

import numpy as np
import scipy.sparse as sp
import numba as nb

def get_diag_ind_vec(csr_array):
  inds=csr_array.nonzero()
  return np.array(np.where(inds[0]==inds[1])[0])

Method 2 方法二

Loopy approaches are in general no problem regarding peformance, as long as you make use of Compiler eg. 只要使用Compiler，例如，循环方法通常就性能而言不会有问题。 Numba or Cython . Numba或Cython 。 I allocated memory for the maximum diagonal elements that could occour. 我为可能发生的最大对角元素分配了内存。 If this method uses to much memory it can be easily modified. 如果此方法占用大量内存，则可以轻松对其进行修改。

@nb.jit()
def get_diag_ind(csr_array):
    ind=np.empty(csr_array.shape[0],dtype=np.uint64)
    rowPtr=csr_array.indptr
    colInd=csr_array.indices

    ii=0
    for i in range(rowPtr.shape[0]-1):
      for j in range(rowPtr[i],rowPtr[i+1]):
        if (i==colInd[j]):
          ind[ii]=j
          ii+=1

    return ind[:ii]

Timings 计时

csr_array = sp.random(1000, 1000, 0.5, format='csr')

get_diag_ind_vec(csr_array)   -> 8.25ms
get_diag_ind(csr_array)       -> 0.65ms (first call excluded)

Answer 3

Here's my solution which seems to be faster than get_rowwise (Andras Deak) and get_diag_ind_vec (max9111) (I do not consider the use of Numba or Cython). 这是我的解决方案，它似乎比get_rowwise （Andras Deak）和get_diag_ind_vec （max9111）快（我不考虑使用Numba或Cython）。

The idea is to set the non-zero diagonal elements of the matrix (or its copy) to some unique value x that is not in the original matrix (I chose the max value + 1), and then simply use np.where(a.data == x) to return the desired indices. 想法是将矩阵（或其副本）的非零对角元素设置为不在原始矩阵中的某个唯一值x （我选择了最大值+ 1），然后简单地使用np.where(a.data == x)以返回所需的索引。

def diag_ind(a):
    a = a.copy()
    i = a.diagonal() != 0  
    x = np.max(a.data) + 1
    a[i, i] = x
    return np.where(a.data == x)

Timing: 定时：

A = sp.random(1000, 1000, 0.5, format='csr')

>>> %timeit diag_ind(A)
6.32 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit get_diag_ind_vec(A)
14.6 ms ± 292 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit get_rowwise(A)
24.3 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: copying the sparse matrix (in order to preserve the original matrix) is not memory efficient, so a better solution would be to store the diagonal elements and later use them for restoring the original matrix. 编辑：复制稀疏矩阵（以保留原始矩阵）的存储效率不高，因此更好的解决方案是存储对角线元素，然后将其用于恢复原始矩阵。

def diag_ind2(a):
    a_diag = a.diagonal()
    i = a_diag != 0  
    x = np.max(a.data) + 1
    a[i, i] = x
    ind = np.where(a.data == x)
    a[i, i] = a_diag[np.nonzero(a_diag)]
    return ind

This is even faster: 这甚至更快：

>>> %timeit diag_ind2(A)
2.83 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如何获得稀疏矩阵数据数组的对角元素的索引

问题描述

3 个解决方案

解决方案1
2 2018-10-16 21:25:51

解决方案2
1 已采纳 2018-10-17 09:48:35

解决方案3
0 2018-10-17 14:30:59

如何获得稀疏矩阵数据数组的对角元素的索引

问题描述

3 个解决方案

解决方案1 2 2018-10-16 21:25:51

解决方案2 1 已采纳 2018-10-17 09:48:35

解决方案3 0 2018-10-17 14:30:59

解决方案1
2 2018-10-16 21:25:51

解决方案2
1 已采纳 2018-10-17 09:48:35

解决方案3
0 2018-10-17 14:30:59