简体   繁体   English

如何在大稀疏矩阵中找到非零元素的索引?

[英]How to find indices of non zero elements in large sparse matrix?

i have two sq matrix (a, b) of size in order of 100000 X 100000. I have to take difference of these two matrix (c = ab). 我有两个sq矩阵(a,b)的大小为100000 X 100000.我必须区分这两个矩阵(c = ab)。 Resultant matrix 'c' is a sparse matrix. 结果矩阵'c'是稀疏矩阵。 I want to find the indices of all non-zero elements. 我想找到所有非零元素的索引。 I have to do this operation many times (>100). 我必须多次做这个操作(> 100)。

Simplest way is to use two for loops. 最简单的方法是使用两个for循环。 But that's computationally intensive. 但这是计算密集型的。 Can you tell me any algorithm or package/library preferably in R/python/c to do this as quickly as possible? 你能告诉我任何算法或包/库最好在R / python / c中尽快做到这一点吗?

Since you have two dense matrices then the double for loop is the only option you have. 由于你有两个密集矩阵,所以double for循环是你唯一的选择。 You don't need a sparse matrix class at all since you only want to know the list of indices (i,j) for which a[i,j] != b[i,j] . 您根本不需要稀疏矩阵类,因为您只想知道a[i,j] != b[i,j]的索引(i,j)列表。

In languages like R and Python the double for loop will perform poorly. 在像R和Python这样的语言中,双循环将表现不佳。 I'd probably write this in native code for a double for loop and add the indices to a list object. 我可能会在本机代码中为double for循环编写它,并将索引添加到列表对象中。 But no doubt the wizards of interpreted code (ie R, Python etc.) know efficient ways to do it without resorting to native coding. 但毫无疑问,解释代码的向导(即R,Python等)知道有效的方法,无需借助本机编码。

In R, if you use the Matrix package, and sparseMatrix for the conversion from the coordinate list to the sparse matrix, then you can convert back to the 3 column via: 在R中,如果使用Matrix包和sparseMatrix进行从坐标列表到稀疏矩阵的转换,则可以通过以下方式转换回3列:

TmpX <- as(M, "dgTMatrix")
X3col <- matrix(c(TmpX@i, TmpX@j, TmpX@val), ncol = 3)

This will give you the coordinates and values in the sparse matrix. 这将为您提供稀疏矩阵中的坐标和值。

Depending on the locations of non-zero entries in A and B, you may find it much better to work with the coordinate list than the sparse matrix representation (there are, by the way, dozens of sparse matrix representations), as you can take direct advantage of vectorized operations, rather than rely upon your sparse matrix package to perform optimally. 根据A和B中非零项的位置,您可能会发现使用坐标列表比使用稀疏矩阵表示(顺便提一下,有数十种稀疏矩阵表示)更好,因为您可以采取直接利用向量化运算,而不是依靠稀疏矩阵包来实现最佳性能。 I tend to alternate between using the COO or sparse matrix support in different languages, depending on how I will get the fastest performance for the algorithm of interest. 我倾向于在不同语言中使用COO或稀疏矩阵支持之间交替,这取决于我将如何获得感兴趣的算法的最快性能。


Update 1: I was unaware that your two matrices, A and B, are dense. 更新1:我不知道你的两个矩阵A和B是密集的。 As such, the easiest solution for finding non-zero entries in C is quite simply to not even subtract at first - just compare the entries of A and B. A logical comparison should be faster than a subtraction. 因此,在C中找到非零条目的最简单的解决方案就是简单地首先不减去 - 只需比较A和B的条目。逻辑比较应该比减法更快。 First, find the entries of A and B where A != B , then subtract just those entries. 首先,找到A和B的条目,其中A != B ,然后只减去那些条目。 Next, you simply need to convert from the vectorization of indices in A and B to their (row, col) representation. 接下来,您只需要将A和B中索引的矢量化转换为它们的(row,col)表示。 This is similar to ind2sub and sub2ind of Matlab - take a look at this R reference for the calculations. 这类似于matlab的ind2sub和sub2ind - 看一下这个R参考计算。

have a look at numpy it have everything you ask for and more! 看看numpy它有你要求的一切以及更多!

See this for sparse matrix support 请参阅此内容以获取稀疏矩阵支持

You could use c.nonzero() method: 你可以使用c.nonzero()方法:

>>> from scipy.sparse import lil_eye
>>> c = lil_eye((4, 10)) # as an example
>>> c
<4x10 sparse matrix of type '<type 'numpy.float64'>'
        with 4 stored elements in LInked List format>
>>> c.nonzero()
(array([0, 1, 2, 3], dtype=int32), array([0, 1, 2, 3], dtype=int32))
>>> import numpy as np
>>> np.ascontiguousarray(c)
array([  (0, 0) 1.0
  (1, 1)        1.0
  (2, 2)        1.0
  (3, 3)        1.0], dtype=object)

You don't need to calculate c matrix to find out indexes of non-zero elements in c = a - b ; 您不需要计算c矩阵来找出c = a - b非零元素的索引; you could do (a != b).nonzero() : 你可以做(a != b).nonzero()

>>> a = np.random.random_integers(2, size=(4,4))
>>> b = np.random.random_integers(2, size=(4,4))
>>> (a != b).nonzero()
(array([0, 0, 1, 1, 1, 2, 3]), array([1, 2, 1, 2, 3, 2, 0]))
>>> a - b
array([[ 0,  1,  1,  0],
       [ 0,  1, -1, -1],
       [ 0,  0,  1,  0],
       [-1,  0,  0,  0]])

This code takes less then 0.1s. 此代码少于0.1秒。

m <- matrix(rpois(1000000,0.01),ncol=1000)
m0 <- lapply(seq(NCOL(m)),function(x) which(m[,x] != 0))

EDIT: For sparse matrices of any size (which fits memory). 编辑:适用于任何大小的稀疏矩阵(适合内存)。

DATA 数据

library(data.table)

N <- 1e+5
n <- 1e+6

ta <- data.table(r=sample(seq(N), n,replace=TRUE),
                 c=sample(seq(N), n,replace=TRUE),
                 a=sample(1:20,n,replace=TRUE))
tb <- data.table(r=sample(seq(N), n,replace=TRUE),
                 c=sample(seq(N), n,replace=TRUE),
                 b=sample(1:20,n,replace=TRUE))
setkey(ta,r,c)
setkey(tb,r,c)

CODE

system.time(tw <- ta[tb][is.na(a)|is.na(b)|(a-b != 0),list(r=r,c=c)])

我还没有计时,但最简单的代码是

all.indices<- which (C>0, arr.ind=T)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM