通過scipy.sparse向量（或矩陣）迭代

Question

我想知道用scipy.sparse迭代稀疏矩陣的非零項最好的方法是什么。 例如，如果我執行以下操作：

from scipy.sparse import lil_matrix

x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2

c = 0
for i in x:
  print c, i
  c = c+1

輸出是

0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13   (0, 0) 1.0
14 
15   (0, 0) 2.0
16 
17 
18 
19

因此看起來迭代器正在觸及每個元素，而不僅僅是非零條目。 我已經看過API了

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

並搜索了一下，但我似乎無法找到一個有效的解決方案。

Answer 1

編輯： bbtrb的方法（使用coo_matrix ）比我原來的建議快得多，使用非零。 Sven Marnach建議使用itertools.izip也可以提高速度。 目前最快的是using_tocoo_izip ：

import scipy.sparse
import random
import itertools

def using_nonzero(x):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        ((row,col), x[row,col])

def using_coo(x):
    cx = scipy.sparse.coo_matrix(x)    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo(x):
    cx = x.tocoo()    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo_izip(x):
    cx = x.tocoo()    
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        (i,j,v)

N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

產生這些timeit結果：

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop

Answer 2

最快的方法應該是轉換為coo_matrix ：

cx = scipy.sparse.coo_matrix(x)

for i,j,v in zip(cx.row, cx.col, cx.data):
    print "(%d, %d), %s" % (i,j,v)

Answer 3

要從scipy.sparse代碼部分循環各種稀疏矩陣，我將使用這個小包裝器函數（請注意，對於Python-2，我們鼓勵您使用xrange和izip在大型矩陣上獲得更好的性能）：

from scipy.sparse import *
def iter_spmatrix(matrix):
    """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 

    This will always return:
    >>> (row, column, matrix-element)

    Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.

    Parameters
    ----------
    matrix : ``scipy.sparse.sp_matrix``
      the sparse matrix to iterate non-zero elements
    """
    if isspmatrix_coo(matrix):
        for r, c, m in zip(matrix.row, matrix.col, matrix.data):
            yield r, c, m

    elif isspmatrix_csc(matrix):
        for c in range(matrix.shape[1]):
            for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                yield matrix.indices[ind], c, matrix.data[ind]

    elif isspmatrix_csr(matrix):
        for r in range(matrix.shape[0]):
            for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                yield r, matrix.indices[ind], matrix.data[ind]

    elif isspmatrix_lil(matrix):
        for r in range(matrix.shape[0]):
            for c, d in zip(matrix.rows[r], matrix.data[r]):
                yield r, c, d

    else:
        raise NotImplementedError("The iterator for this sparse matrix has not been implemented")

Answer 4

tocoo（）將整個矩陣表示為一個不同的結構，這不是python 3的首選MO。您還可以考慮這個迭代器，它對大型矩陣特別有用。

from itertools import chain, repeat
def iter_csr(matrix):
  for (row, col, val) in zip(
    chain(*(
          repeat(i, r)
          for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
    )),
    matrix.indices,
    matrix.data
  ):
    yield (row, col, val)

我必須承認我使用了很多python-constructs，它們可能應該被numpy-constructs（尤其是enumerate）取代。

NB ：

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>

所以是的，枚舉有點慢（ish）

對於迭代器：

In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something

所以你決定這個開銷是否可以接受，在我的情況下，tocoo導致了MemoryOverflows 。

恕我直言：這樣的迭代器應該是csr_matrix接口的一部分，類似於dict（）中的items（）:)

Answer 5

我遇到了同樣的問題，實際上，如果你只關注速度，那么最快的方法（快一個數量級以上）就是將稀疏矩陣轉換為密集矩陣（x.todense（）），並迭代非零密集矩陣中的元素。 （當然，這種方法需要更多的內存）

Answer 6

嘗試filter(lambda x:x, x)而不是x 。

通過scipy.sparse向量（或矩陣）迭代

問題描述

6 個解決方案

解決方案1
61 已采納 2010-11-30 21:57:09

解決方案2
32 2010-11-30 22:05:22

解決方案3
2 2017-03-06 12:29:43

解決方案4
1 2015-07-06 11:11:50

解決方案5
1 2010-12-28 16:18:14

解決方案6
0 2010-11-30 21:56:20

通過scipy.sparse向量（或矩陣）迭代

問題描述

6 個解決方案

解決方案1 61 已采納 2010-11-30 21:57:09

解決方案2 32 2010-11-30 22:05:22

解決方案3 2 2017-03-06 12:29:43

解決方案4 1 2015-07-06 11:11:50

解決方案5 1 2010-12-28 16:18:14

解決方案6 0 2010-11-30 21:56:20

解決方案1
61 已采納 2010-11-30 21:57:09

解決方案2
32 2010-11-30 22:05:22

解決方案3
2 2017-03-06 12:29:43

解決方案4
1 2015-07-06 11:11:50

解決方案5
1 2010-12-28 16:18:14

解決方案6
0 2010-11-30 21:56:20