简体   繁体   English

Scipy:来自数组的稀疏指标矩阵

[英]Scipy: Sparse indicator matrix from array(s)

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b , with I[i,j]==True where a[i]==b[j] ?从一个或两个数组a,b计算稀疏布尔矩阵I的最有效方法是什么,其中I[i,j]==True其中a[i]==b[j] The following is fast but memory-inefficient:以下是快速但内存效率低下的:

I = a[:,None]==b

The following is slow and still memory-inefficient during creation:以下是在创建过程中速度缓慢且内存效率低下的:

I = csr((a[:,None]==b),shape=(len(a),len(b)))

The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:下面至少给出了行,cols 以获得更好的csr_matrix初始化,但它仍然创建完整的密集矩阵并且同样慢:

z = np.argwhere((a[:,None]==b))

Any ideas?有任何想法吗?

One way to do it would be to first identify all different elements that a and b have in common using set s.一种方法是首先使用set s 识别ab共有的所有不同元素。 This should work well if there are not very many different possibilities for the values in a and b .如果ab的值没有太多不同的可能性,这应该很有效。 One then would only have to loop over the different values (below in variable values ) and use np.argwhere to identify the indices in a and b where these values occur.然后只需要遍历不同的值(在变量values下方)并使用np.argwhere来识别ab中出现这些值的索引。 The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile :然后可以使用np.repeatnp.tile构建稀疏矩阵的二维索引:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))

##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []

##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)

##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )

##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)

The syntax for generating the csr matrix is taken from the documentation .生成 csr 矩阵的语法取自文档 The test for sparse matrix equality is taken from this post .稀疏矩阵相等性的测试取自这篇文章

Old Answer :旧答案

I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression.我不知道性能,但至少您可以通过使用简单的生成器表达式来避免构建完整的密集矩阵。 Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:这里的一些代码使用两个 1d 随机整数 arras 首先按照 OP 发布的方式生成稀疏矩阵,然后使用生成器表达式来测试所有元素的相等性:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

## matrix generation using generator
data, rows, cols = zip(
    *((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0)  ## --> True

I think there is no way around the double loop and ideally this would be pushed into numpy , but at least with the generator the loops are somewhat optimised ...我认为没有办法绕过双循环,理想情况下这将被推入numpy ,但至少对于生成器,循环有些优化......

You could use numpy.isclose with small tolerance:您可以使用numpy.isclose容忍小:

np.isclose(a,b)

Or pandas.DataFrame.eq :或者pandas.DataFrame.eq

a.eq(b)

Note this returns an array of True False .请注意,这将返回一个True False数组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM