[英]Fastest way from logic matrix to list of sets
I need to convert a sparse logic matrix into a list of sets, where each list[i] contains the set of rows with nonzero values for column[i]. 我需要将稀疏逻辑矩阵转换为集合列表,其中每个列表[i]包含具有列[i]的非零值的行集。 The following code works, but I'm wondering if there's a faster way to do this.
以下代码有效,但我想知道是否有更快的方法来执行此操作。 The actual data I'm using is approx 6000x6000 and much more sparse than this example.
我使用的实际数据大约是6000x6000,比这个例子要稀疏得多。
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
C = np.nonzero(A)
D = [set() for j in range(cols)]
for i in range(len(C[0])):
D[C[1][i]].add(C[0][i])
print D
If you represent the sparse array as a csc_matrix
, you can use the indices
and indptr
attributes to create the sets. 如果将稀疏数组表示为
csc_matrix
,则可以使用indices
和indptr
属性来创建集合。
For example, 例如,
In [93]: A
Out[93]:
array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
In [94]: from scipy.sparse import csc_matrix
In [95]: C = csc_matrix(A)
In [96]: C.indptr
Out[96]: array([ 0, 5, 8, 12, 16, 20, 23], dtype=int32)
In [97]: C.indices
Out[97]: array([0, 2, 3, 4, 5, 1, 3, 4, 1, 2, 6, 7, 1, 3, 4, 6, 1, 2, 6, 7, 0, 2, 3], dtype=int32)
In [98]: D = [set(C.indices[C.indptr[i]:C.indptr[i+1]]) for i in range(C.shape[1])]
In [99]: D
Out[99]:
[{0, 2, 3, 4, 5},
{1, 3, 4},
{1, 2, 6, 7},
{1, 3, 4, 6},
{1, 2, 6, 7},
{0, 2, 3}]
For a list of arrays instead of sets, just don't call set()
: 对于数组而不是集合的列表,只需不要调用
set()
:
In [100]: [C.indices[C.indptr[i]:C.indptr[i+1]] for i in range(len(C.indptr)-1)]
Out[100]:
[array([0, 2, 3, 4, 5], dtype=int32),
array([1, 3, 4], dtype=int32),
array([1, 2, 6, 7], dtype=int32),
array([1, 3, 4, 6], dtype=int32),
array([1, 2, 6, 7], dtype=int32),
array([0, 2, 3], dtype=int32)]
Since you already called np.nonzero
on A
, see if this works faster: 由于您已经在
A
上调用了np.nonzero
,请查看它是否更快:
>>> from itertools import groupby
>>> C = np.transpose(np.nonzero(A.T))
>>> [{i[1] for i in g} for _, g in groupby(C, key=lambda x: x[0])]
[{0, 2, 3, 4, 5}, {1, 3, 4}, {1, 2, 6, 7}, {1, 3, 4, 6}, {1, 2, 6, 7}, {0, 2, 3}]
Some timing: 一些时间:
In [4]: %%timeit
...: C = np.transpose(np.nonzero(A.T))
...: [{i[1] for i in g} for _, g in groupby(C, key=lambda x: x[0])]
...:
10000 loops, best of 3: 39 µs per loop
In [7]: %%timeit
...: C=csc_matrix(A)
...: [set(C.indices[C.indptr[i]:C.indptr[i+1]]) for i in range(C.shape[1])]
...:
1000 loops, best of 3: 317 µs per loop
I don't know if increases speed much, but your iteration can streamlined with 我不知道是否增加了很多速度,但你的迭代可以简化
for i,j in zip(*C):
D[j].add(i)
A defaultdict could add a nice touch to this task: defaultdict可以为此任务添加一个很好的触摸:
In [58]: from collections import defaultdict
In [59]: D=defaultdict(set)
In [60]: for i,j in zip(*C):
D[j].add(i)
In [61]: D
Out[61]: defaultdict(<class 'set'>, {0: {0, 2, 3, 4, 5}, 1: {1, 3, 4}, 2: {1, 2, 6, 7}, 3: {1, 3, 4, 6}, 4: {1, 2, 6, 7}, 5: {0, 2, 3}})
In [62]: dict(D)
Out[62]:
{0: {0, 2, 3, 4, 5},
1: {1, 3, 4},
2: {1, 2, 6, 7},
3: {1, 3, 4, 6},
4: {1, 2, 6, 7},
5: {0, 2, 3}}
An alternative with sparse matrices is the lil
format which saves the data a list of lists. 具有稀疏矩阵的替代方案是
lil
格式,其将数据保存为列表列表。 Since you want to collect data by column, make the matrix from AT
(transpose) 由于您想按列收集数据,请从
AT
(转置)生成矩阵
In [70]: M=sparse.lil_matrix(A.T)
In [71]: M.rows
Out[71]:
array([[0, 2, 3, 4, 5], [1, 3, 4], [1, 2, 6, 7], [1, 3, 4, 6],
[1, 2, 6, 7], [0, 2, 3]], dtype=object)
Which are the same lists. 哪些是相同的列表。
For this small case direct iteration is faster than sparse 对于这种小情况,直接迭代比稀疏更快
In [72]: %%timeit
....: D=defaultdict(set)
....: for i,j in zip(*C):
D[j].add(i)
....:
10000 loops, best of 3: 24.4 µs per loop
In [73]: %%timeit
....: D=[set() for j in range(A.shape[1])]
....: for i,j in zip(*C):
D[j].add(i)
....:
10000 loops, best of 3: 22.9 µs per loop
In [74]: %%timeit
....: M=sparse.lil_matrix(A.T)
....: M.rows
....:
1000 loops, best of 3: 588 µs per loop
In [75]: %%timeit
....: C=sparse.csc_matrix(A)
....: D = [set(C.indices[C.indptr[i]:C.indptr[i+1]]) for i in range(C.shape[1])] ....:
1000 loops, best of 3: 476 µs per loop
For a large array, the setup time for the sparse matrix will less significant. 对于大型数组,稀疏矩阵的设置时间不太重要。
========================== ==========================
Do we really need set
? 我们真的需要
set
吗? A variation on the lil
approach is to start with the nonzero
on the transpose, ie by column lil
方法的一个变体是从转置的nonzero
开始,即按列
In [90]: C=np.nonzero(A.T)
# (array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5], dtype=int32),
# array([0, 2, 3, 4, 5, 1, 3, 4, 1, 2, 6, 7, 1, 3, 4, 6, 1, 2, 6, 7, 0, 2, 3], dtype=int32))
The numbers are all there; 这些数字都在那里; we just have to split the 2nd list into pieces corresponding to the first
我们只需要将第二个列表拆分成与第一个列表相对应的部分
In [91]: i=np.nonzero(np.diff(C[0]))[0]+1
In [92]: np.split(C[1],i)
Out[92]:
[array([0, 2, 3, 4, 5], dtype=int32),
array([1, 3, 4], dtype=int32),
array([1, 2, 6, 7], dtype=int32),
array([1, 3, 4, 6], dtype=int32),
array([1, 2, 6, 7], dtype=int32),
array([0, 2, 3], dtype=int32)]
This is slower than the direct iteration but I suspect it scales better; 这比直接迭代慢,但我怀疑它更好地扩展; possibly as well as any of the sparse alternatives:
可能以及任何稀疏的替代品:
In [96]: %%timeit
C=np.nonzero(A.T)
....: i=np.nonzero(np.diff(C[0]))[0]+1
....: np.split(C[1],i)
....:
10000 loops, best of 3: 55.2 µs per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.