简体   繁体   中英

Most efficient way to index into a numpy array from a scipy CSR matrix?

I have a numpy ndarray X with shape (4000, 3) , where each sample in X is a 3D coordinate (x,y,z).

I have a scipy csr matrix nn_rad_csr of shape (4000, 4000) , which is the nearest neighbors graph generated from sklearn.neighbors.radius_neighbors_graph(X, 0.01, include_self=True) .

nn_rad_csr.toarray()[i] is a shape (4000,) sparse vector with binary weights (0 or 1) associated with the edges in the nearest neighbors graph from node X[i] .

For instance, if nn_rad_csr.toarray()[i][j] == 1 then X[j] is within the nearest neighbor radius of X[i] , whereas a value of 0 means it is not a neighbor.

What I'd like to do is have a function radius_graph_conv(X, rad) which returns an array Y which is X , averaged by its neighbors' values. I'm not sure how to exploit the sparsity of a CSR matrix to efficiently perform radius_graph_conv . I have two naive implementations of graph conv below.

import numpy as np
from sklearn.neighbors import radius_neighbors_graph, KDTree

def radius_graph_conv(X, rad):
    nn_rad_csr = radius_neighbors_graph(X, rad, include_self=True)
    csr_indices = nn_rad_csr.indices
    csr_indptr  = nn_rad_csr.indptr
    Y = np.copy(X)
    for i in range(X.shape[0]):
        j, k = csr_indptr[i], csr_indptr[i+1]
        neighbor_idx = csr_indices[j:k]
        rad_neighborhood = X[neighbor_idx] # ndim always 2
        Y[i] = np.mean(rad_neighborhood, axis=0)
    return Y

def radius_graph_conv_matmul(X, rad):
    nn_rad_arr = radius_neighbors_graph(X, rad, include_self=True).toarray()
    # np.sum(nn_rad_arr, axis=-1) is basically a count of neighbors

    return np.matmul(nn_rad_arr / np.sum(nn_rad_arr, axis=-1), X)

Is there a better way to do this? With a knn graph, its a very simple function, since the number of neighbors is fixed and you can just index into X, but with a radius or density based nearest neighbors graph, you have to work with a CSR, (or an array of arrays if you are using a kd tree).

Here is the direct way of exploiting csr format. Your matmul solution probably does similar things under the hood. But we save one lookup (from the .data attribute) by also exploiting that it is an adjacency matrix; also, diff ing .indptr should be more efficient than summing the equivalent amount of ones.

>>> import numpy as np
>>> from scipy import sparse
>>> 
# create mock data
>>> A = np.random.random((100, 100)) < 0.1
>>> A = (A | A.T).view(np.uint8)
>>> AS = sparse.csr_matrix(A)
>>> X = np.random.random((100, 3))
>>> 
# dense solution for reference
>>> Xa = A @ X / A.sum(axis=-1, keepdims=True)
# sparse solution
>>> XaS = np.add.reduceat(X[AS.indices], AS.indptr[:-1], axis=0) / np.diff(AS.indptr)[:, None]
>>> 
# check they are the same
>>> np.allclose(Xa, XaS)
True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM