简体   繁体   English

SciPy / numpy:仅保留稀疏矩阵块的最大值

[英]SciPy/numpy: Only keep maximum value of a sparse matrix block

I am trying to operate on a large sparse matrix (currently 12000 x 12000). 我正在尝试在大型稀疏矩阵上操作(当前为12000 x 12000)。 What I want to do is to set blocks of it to zero but keep the largest value within this block. 我想要做的是将其块设置为零,但将最大值保留在该块内。 I already have a running solution for dense matrices: 我已经有一个适用于密集矩阵的运行解决方案:

import numpy as np
from scipy.sparse import random

np.set_printoptions(precision=2)
#x = random(10,10,density=0.5)
x = np.random.random((10,10))
x = x.T * x
print(x)

def keep_only_max(a,b,c,d):
  sub = x[a:b,c:d]
  z = np.max(sub)
  sub[sub < z] = 0


sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)

for i in range(1,len(sizes)):
  current_i_min = sizes_sum[i-1]
  current_i_max = sizes_sum[i]
  for j in range(1,len(sizes)):
    if i >= j:
      continue
    current_j_min = sizes_sum[j-1]
    current_j_max = sizes_sum[j]

    keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
    keep_only_max(current_j_min, current_j_max, current_i_min, current_i_max)

print(x)

This, however, doesn't work for sparse matrices (try uncommenting the line on top). 但是,这不适用于稀疏矩阵(尝试取消注释顶部的行)。 Any ideas how I could efficiently implement this without calling todense()? 有什么想法可以在不调用todense()的情况下有效实现吗?

def keep_only_max(a,b,c,d):
  sub = x[a:b,c:d]
  z = np.max(sub)
  sub[sub < z] = 0

For a sparse x , the sub slicing works for csr format. 对于稀疏的xsub切片适用于csr格式。 It won't be as fast as the equivalent dense slice, but it will create a copy of that part of x . 它不会像等效的密集切片那样快,但是会创建x的该部分的副本。

I'd have to check the sparse max functions. 我必须检查稀疏的max函数。 But I can imagine convertering sub to coo format, using np.argmax on the .data attribute, and with the corresponding row and col values, constructing a new matrix of the same shape but just one nonzero value. 但是我可以想象将sub转换为coo格式,在.data属性上使用np.argmax ,并具有相应的rowcol值,构造一个形状相同但只有一个非零值的新矩阵。

If your blocks covered x in a regular, nonoverlapping manner, I'd suggest constructing a new matrix with sparse.bmat . 如果您的块以常规,不重叠的方式覆盖x ,则建议您使用sparse.bmat构造一个新矩阵。 That basically collects the coo attributes of all the components, joins them into one set of arrays with the appropriate offsets, and makes a new coo matrix. 这基本上收集了所有组件的coo属性,将它们连接到具有适当偏移量的一组数组中,并创建了一个新的coo矩阵。

If the blocks are scattered or overlap you might have to generate, and insert them back into x one by one. 如果块散落或重叠,则可能必须生成,然后将它们一个接一个地插入x csr format should work for that, but it will issue a sparse efficiency warning. csr格式应该可以解决此问题,但是会发出稀疏效率警告。 lil is supposed to be faster for changing values. lil应该可以更快地更改值。 I think it will accept blocks. 我认为它将接受障碍。

I can imagine doing this with sparse matrices, but it will take time to setup a test case and debug the process. 我可以想象使用稀疏矩阵来执行此操作,但是设置测试用例和调试过程将花费一些时间。

Thanks to hpaulj I managed to implement a solution using scipy.sparse.bmat : 多亏了hpaulj,我得以使用scipy.sparse.bmat实现了一个解决方案:

from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix
from scipy.sparse import rand
from scipy.sparse import bmat
import numpy as np


np.set_printoptions(precision=2)

# my matrices are symmetric, so generate random symmetric matrix
x = rand(10,10,density=0.4)
x = x.T * x
x = x


def keep_only_max(a,b,c,d):
    sub = x[a:b,c:d]
    z = np.unravel_index(sub.argmax(),sub.shape)
    i1 = z[0]
    j1 = z[1]
    new = csr_matrix(([sub[i1,j1]],([i1],[j1])),shape=(b-a,d-c))
    return new

def keep_all(a,b,c,d):
    return x[a:b,c:d].copy()

# we want to create a chessboard pattern where the first central block is 1x1, the second 5x5 and the last 4x4
sizes = np.asarray([0,1,5,4])
sizes_sum = np.cumsum(sizes)

# acquire 2D array to store our chessboard blocks
r = range(len(sizes)-1)
blocks = [[0 for x in r] for y in r] 


for i in range(1,len(sizes)):
    current_i_min = sizes_sum[i-1]
    current_i_max = sizes_sum[i]

    for j in range(i,len(sizes)):

        current_j_min = sizes_sum[j-1]
        current_j_max = sizes_sum[j]

        if i == j:
            # keep the blocks at the diagonal completely
            sub = keep_all(current_i_min, current_i_max, current_j_min, current_j_max)
            blocks[i-1][j-1] = sub
        else:
            # the blocks not on the digonal only keep their maximum value
            current_j_min = sizes_sum[j-1]
            current_j_max = sizes_sum[j]

            # we can leverage the matrix symmetry and only calculate one new matrix.
            m1 = keep_only_max(current_i_min, current_i_max, current_j_min, current_j_max)
            m2 = m1.T

            blocks[i-1][j-1] = m1
            blocks[j-1][i-1] = m2


z = bmat(blocks)
print(z.todense())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM