简体   繁体   English

在scipy稀疏矩阵上mask为True时将元素设置为零的有效方法

[英]Efficient way to set elements to zero where mask is True on scipy sparse matrix

I have two scipy_sparse_csr_matrix 'a' and scipy_sparse_csr_matrix(boolean) 'mask', and I want to set elements of 'a' to zero where element of mask is True. 我有两个scipy_sparse_csr_matrix'a'和scipy_sparse_csr_matrix(boolean)'mask',我想将'a'的元素设置为零,其中mask的元素为True。

for example 例如

>>>a
<3x3 sparse matrix of type '<type 'numpy.int32'>'
    with 4 stored elements in Compressed Sparse Row format>
>>>a.todense()
matrix([[0, 0, 3],
        [0, 1, 5],
        [7, 0, 0]])

>>>mask
<3x3 sparse matrix of type '<type 'numpy.bool_'>'
    with 4 stored elements in Compressed Sparse Row format>
>>>mask.todense()
matrix([[ True, False,  True],
        [False, False,  True],
        [False,  True, False]], dtype=bool)

Then I want to obtain the following result. 然后,我想获得以下结果。

>>>result
<3x3 sparse matrix of type '<type 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>
>>>result.todense()
matrix([[0, 0, 0],
        [0, 1, 0],
        [7, 0, 0]])

I can do it by operation like 我可以通过像

result = a - a.multiply(mask)

or 要么

a -= a.multiply(mask) #I don't care either in-place or copy.

But I think above operations are inefficient. 但是我认为上述操作效率低下。 Since actual shape of 'a' and 'mask' are 67,108,864 × 2,000,000, these operations take several seconds on high spec server(64 core Xeon cpu, 512GB memory). 由于“ a”和“ mask”的实际形状为67,108,864×2,000,000,因此这些操作在高规格服务器(64核Xeon cpu,512GB内存)上花费几秒钟。 For example, 'a' has about 30,000,000 non-zero elements, and 'mask' has about 1,800,000 non-zero(True) elements, then above operation take about 2 seconds. 例如,“ a”具有大约30,000,000个非零元素,而“ mask”具有大约1,800,000个非零(True)元素,那么上述操作大约需要2秒钟。

Is there more efficient way to do this? 有更有效的方法吗?

Conditions are below. 条件如下。

  1. a.getnnz() != mask.getnnz() a.getnnz()!= mask.getnnz()
  2. a.shape = mask.shape a.shape = mask.shape

Thanks! 谢谢!

Other way(tried) 其他方式(尝试)

a.data*=~np.array(mask[a.astype(np.bool)]).flatten();a.eliminate_zeros() #This takes twice the time longer than above method.

My initial impression is that this multiply and subtract approach is a reasonable one. 我最初的印象是这种乘减法是一种合理的方法。 Quite often sparse code implements operations as some sort of multiplication, even if the dense equivalents use more direct methods. 即使密集等效项使用更直接的方法, sparse代码也经常将操作实现为某种乘法。 The sparse sum over rows or columns uses a matrix multiplication with the appropriate row or column matrix of 1s. 行或列上的稀疏总和使用矩阵乘法,且行或列的适当矩阵为1s。 Even row or column indexing uses matrix multiplication (at least on the csr format). 偶数行或列索引使用矩阵乘法(至少在csr格式上)。

Sometimes we can improve on operations by working directly with the matrix attributes ( data , indices , indptr ). 有时候,我们可以通过直接与基体的属性(工作提高运营dataindicesindptr )。 But that requires a lot more thought and experimentation. 但这需要更多的思考和实验。

For the dense arrays my first try would be 对于密集阵列,我的第一个尝试是

In [611]: a.A*~(mask.A)
Out[611]: 
array([[0, 0, 0],
       [0, 1, 0],
       [7, 0, 0]], dtype=int32)

But there isn't a direct way of doing not to a sparse matrix. 但是,没有直接的方法not稀疏矩阵进行处理。 If mask was indeed sparse, ~mask would not be. 如果mask确实很稀疏,则~mask不会。 In your example mask has 4 True terms, and 5 False, so a dense version would work just as well: 在您的示例中, mask具有4个True项和5个False,因此密集版本同样适用:

In [612]: nmask=sparse.csr_matrix(~(mask.A))
In [615]: a.multiply(nmask)
Out[615]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>

CSR scipy matrix does not update after updating its values explores setting the diagonal of a sparse matrix to 0. It is possible to set values of the data attribute to 0, and then eliminate_zeros once at the end. 在尝试将稀疏矩阵的对角线设置为0 后,更新其值后CSR scipy矩阵不会更新。可以将data属性的值设置为0,然后在最后一次eliminate_zeros clear_zeros。

The other dense method is 另一种密集的方法是

In [618]: a1=a.A
In [619]: a1[mask.A]=0

This also works in sparse - sort of 这也适用于sparse -

In [622]: a2=a.copy()
In [624]: a2[mask]
Out[624]: matrix([[0, 3, 5, 0]], dtype=int32)
In [625]: a2[mask]=0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
In [626]: a2
Out[626]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 6 stored elements in Compressed Sparse Row format>

As noted in the previous question, we need to eliminate the zeros: 如上一个问题所述,我们需要消除零:

In [628]: a2.eliminate_zeros()
In [629]: a2
Out[629]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>

Taking a hint from the sparsity warning let's try the lil format 从稀疏警告中获取提示,让我们尝试一下lil格式

In [638]: al=a.tolil()
In [639]: al[mask]
Out[639]: 
<1x4 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in LInked List format>
In [640]: al[mask]=0
In [641]: al
Out[641]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in LInked List format>

It's interesting that al[mask] is still sparse, where as a[mask] is dense. 有趣的是al[mask]仍然很稀疏,而a[mask]则很密集。 The 2 formats use different indexing methods. 这两种格式使用不同的索引方法。

At some low level of sparsity, it might be worth iterating over the True (nonzero) elements of mask , setting the corresponding terms of a to zero directly. 在稀疏的一些低的水平,这可能是值得迭代过的真(非零)元素mask ,对相应的术语设定a零直接。

I'm not going to guess as to the relative speeds of these methods. 我不会猜测这些方法的相对速度。 That needs to be tested on realistic data. 需要对实际数据进行测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM