稀疏矩阵行中的连续元素

Question

I am working on a sparse matrix stored in COO format.我正在研究以 COO 格式存储的稀疏矩阵。 What would be the fastest way to get the number of consecutive elements per each row.获得每行连续元素数量的最快方法是什么。

For example consider the following matrix:例如，考虑以下矩阵：

a = [[0,1,2,0],[1,0,0,2],[0,0,0,0],[1,0,1,0]]

Its COO representation would be它的首席运营官代表将是

  (0, 1)    1
  (0, 2)    2
  (1, 0)    1
  (1, 3)    2
  (3, 0)    1
  (3, 2)    1

I need the result to be [1,2,0,2] .我需要结果为[1,2,0,2] 。 The first row contains two Non-zero elements that lies nearby.第一行包含两个位于附近的非零元素。 Hence its a group or set.因此它是一组或一组。 In the second row we have two non-zero elements,but they dont lie nearby, and hence we can say that it forms two groups.在第二行我们有两个非零元素，但它们不在附近，因此我们可以说它形成了两个组。 The third row there are no non-zeroes and hence no groups.第三行没有非零，因此没有组。 The fourth row has again two non-zeroes but separated by zeroes nad hence we consider as two groups.第四行再次有两个非零，但由零分隔，因此我们将其视为两组。 It would be like the number of clusters per row.这就像每行的簇数。 Iterating through the rows are an option but only if there is no faster solution.遍历行是一种选择，但前提是没有更快的解决方案。 Any help in this regard is appreciated.在这方面的任何帮助表示赞赏。

Another simple example: consider the following row:另一个简单的例子：考虑以下行：

[1,2,3,0,0,0,2,0,0,8,7,6,0,0]

The above row should return [3] sine there are three groups of non-zeroes getting separated by zeroes.上面的行应该返回[3]正弦有三组非零被零分隔。

Answer 1

Convert it to a dense array, and apply your logic row by row.将其转换为密集数组，并逐行应用您的逻辑。

you want the number of groups per row你想要每行的组数
zeros count when defining groups定义组时零计数
row iteration is faster with arrays数组的行迭代速度更快

In coo format your matrix looks like:在coo格式中，您的矩阵如下所示：

In [623]: M=sparse.coo_matrix(a)
In [624]: M.data
Out[624]: array([1, 2, 1, 2, 1, 1])
In [625]: M.row
Out[625]: array([0, 0, 1, 1, 3, 3], dtype=int32)
In [626]: M.col
Out[626]: array([1, 2, 0, 3, 0, 2], dtype=int32)

This format does not implement row indexing;这种格式不实现行索引； csr and lil do csr和lil做

In [627]: M.tolil().data
Out[627]: array([[1, 2], [1, 2], [], [1, 1]], dtype=object)
In [628]: M.tolil().rows
Out[628]: array([[1, 2], [0, 3], [], [0, 2]], dtype=object)

So the sparse information for the 1st row is a list of nonzero data values, [1,2] , and list of their column numbers, [1,2] .因此，第一行的稀疏信息是非零数据值列表[1,2]及其列号列表[1,2] 。 Compare that with the row of the dense array, [0, 1, 2, 0] .将其与密集数组的行[0, 1, 2, 0] 。 Which is easier to analyze?哪个更容易分析？

Your first task is to write a function that analyzes one row.您的第一个任务是编写一个分析一行的函数。 I haven't studied your logic enough to say whether the dense form is better than the sparse one or not.我还没有研究足够的逻辑来说明密集形式是否比稀疏形式更好。 It is easy to get the column information from the dense form with MA[0,:].nonzero() .使用MA[0,:].nonzero()很容易从密集形式中获取列信息。

In your last example, I can get the nonzero indices:在您的最后一个示例中，我可以获得非零索引：

In [631]: np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])
Out[631]: (array([ 0,  1,  2,  6,  9, 10, 11], dtype=int32),)
In [632]: idx=np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])[0]
In [633]: idx
Out[633]: array([ 0,  1,  2,  6,  9, 10, 11], dtype=int32)
In [634]: np.diff(idx)
Out[634]: array([1, 1, 4, 3, 1, 1], dtype=int32)

We may be able to get the desired count from the number of diff values >1 , though I'd have to look at more examples to define the details.我们也许能够从diff值的数量>1获得所需的计数，但我必须查看更多示例来定义细节。

Extension of the analysis to multiple rows depends on first thoroughly understanding the single row case.将分析扩展到多行取决于首先彻底了解单行情况。

Answer 2

With the help of @hpaulj s comment I came up with following snippet to do this:在@hpaulj的评论的帮助下，我想出了以下代码片段来做到这一点：

    M = m.tolil()
r = []  
for i in range(M.shape[0]):
   sumx=0
   idx= M.rows[i]
   if (len(idx) > 2):
       tempidx = np.diff(idx)
       if (1 in tempidx):
           temp = filter(lambda a: a != 1, tempidx)
           sumx=1
       counts = len(temp)
       r.append(counts+sumx)
   elif (len(idx) == 2):
       tempidx = np.diff(idx)
       if(tempidx[0]==1):
           counts = 1
           r.append(counts)
       else:
           counts = 2
           r.append(counts)
   elif (len(idx) == 1):
       counts = 1
       r.append(counts) 
   else:
       counts = 0
       r.append(counts)
tempcluster = np.sum(r)/float(M.shape[0])
cluster.append(tempcluster)

稀疏矩阵行中的连续元素

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-10-17 16:47:07

解决方案2
0 2016-11-03 03:51:20

稀疏矩阵行中的连续元素

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-10-17 16:47:07

解决方案2 0 2016-11-03 03:51:20

解决方案1
1 已采纳 2016-10-17 16:47:07

解决方案2
0 2016-11-03 03:51:20