简体   繁体   中英

How to get indices of diagonal elements of a sparse matrix data array

I have a sparse matrix in csr format, eg:

>>> a = sp.random(3, 3, 0.6, format='csr')  # an example
>>> a.toarray()  # just to see how it looks like
array([[0.31975333, 0.88437035, 0.        ],
       [0.        , 0.        , 0.        ],
       [0.14013856, 0.56245834, 0.62107962]])
>>> a.data  # data array
array([0.31975333, 0.88437035, 0.14013856, 0.56245834, 0.62107962])

For this particular example, I want to get [0, 4] which are the data-array indices of the non-zero diagonal elements 0.31975333 and 0.62107962 .

A simple way to do this is the following:

ind = []
seen = set()
for i, val in enumerate(a.data):
    if val in a.diagonal() and val not in seen:
        ind.append(i)
        seen.add(val)

But in practice the matrix is very big, so I don't want to use the for loops or convert to numpy array using toarray() method. Is there a more efficient way to do it?

Edit : I just realized that the above code gives incorrect result in cases when there are off-diagonal elements equal to and preceding some of the diagonal elements: it returns the indices of that off-diagonal element. Also, it doesn't return the indices of repeating diagonal elements. For example:

a = np.array([[0.31975333, 0.88437035, 0.        ],
              [0.62107962, 0.31975333, 0.        ],
              [0.14013856, 0.56245834, 0.62107962]])
a = sp.csr_matrix(a)

>>> a.data
array([0.31975333, 0.88437035, 0.62107962, 0.31975333, 0.14013856,
       0.56245834, 0.62107962])

My code returns ind = [0, 2] , but it should be [0, 3, 6] . The code provided by Andras Deak (his get_rowwise function), returns the correct result.

I've found a possibly more efficient solution, though it still loops. However, it loops over the rows of the matrix rather than on the elements themselves. Depending on the sparsity pattern of your matrix this might or might not be faster. This is guaranteed to cost N iterations for a sparse matrix with N rows.

We just loop through each row, fetch the filled column indices via a.indices and a.indptr , and if the diagonal element for the given row is present in the filled values then we compute its index:

import numpy as np
import scipy.sparse as sp

def orig_loopy(a):
    ind = []
    seen = set()
    for i, val in enumerate(a.data):
        if val in a.diagonal() and val not in seen:
            ind.append(i)
            seen.add(val)
    return ind

def get_rowwise(a):
    datainds = []
    indices = a.indices # column indices of filled values
    indptr = a.indptr   # auxiliary "pointer" to data indices
    for irow in range(a.shape[0]):
        rowinds = indices[indptr[irow]:indptr[irow+1]] # column indices of the row
        if irow in rowinds:
            # then we've got a diagonal in this row
            # so let's find its index
            datainds.append(indptr[irow] + np.flatnonzero(irow == rowinds)[0])
    return datainds

a = sp.random(300, 300, 0.6, format='csr')
orig_loopy(a) == get_rowwise(a) # True

For a (300,300) -shaped random input with the same density the original version runs in 3.7 seconds, the new version runs in 5.5 milliseconds.

Method 1

This is a vectorized approach, which generates all nonzero indices first and than gets the positions where row and column index is the same. This is a bit slow and has a high memory usage.

import numpy as np
import scipy.sparse as sp
import numba as nb

def get_diag_ind_vec(csr_array):
  inds=csr_array.nonzero()
  return np.array(np.where(inds[0]==inds[1])[0])

Method 2

Loopy approaches are in general no problem regarding peformance, as long as you make use of Compiler eg. Numba or Cython . I allocated memory for the maximum diagonal elements that could occour. If this method uses to much memory it can be easily modified.

@nb.jit()
def get_diag_ind(csr_array):
    ind=np.empty(csr_array.shape[0],dtype=np.uint64)
    rowPtr=csr_array.indptr
    colInd=csr_array.indices

    ii=0
    for i in range(rowPtr.shape[0]-1):
      for j in range(rowPtr[i],rowPtr[i+1]):
        if (i==colInd[j]):
          ind[ii]=j
          ii+=1

    return ind[:ii]

Timings

csr_array = sp.random(1000, 1000, 0.5, format='csr')

get_diag_ind_vec(csr_array)   -> 8.25ms
get_diag_ind(csr_array)       -> 0.65ms (first call excluded)

Here's my solution which seems to be faster than get_rowwise (Andras Deak) and get_diag_ind_vec (max9111) (I do not consider the use of Numba or Cython).

The idea is to set the non-zero diagonal elements of the matrix (or its copy) to some unique value x that is not in the original matrix (I chose the max value + 1), and then simply use np.where(a.data == x) to return the desired indices.

def diag_ind(a):
    a = a.copy()
    i = a.diagonal() != 0  
    x = np.max(a.data) + 1
    a[i, i] = x
    return np.where(a.data == x)

Timing:

A = sp.random(1000, 1000, 0.5, format='csr')

>>> %timeit diag_ind(A)
6.32 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit get_diag_ind_vec(A)
14.6 ms ± 292 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit get_rowwise(A)
24.3 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: copying the sparse matrix (in order to preserve the original matrix) is not memory efficient, so a better solution would be to store the diagonal elements and later use them for restoring the original matrix.

def diag_ind2(a):
    a_diag = a.diagonal()
    i = a_diag != 0  
    x = np.max(a.data) + 1
    a[i, i] = x
    ind = np.where(a.data == x)
    a[i, i] = a_diag[np.nonzero(a_diag)]
    return ind

This is even faster:

>>> %timeit diag_ind2(A)
2.83 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM