简体   繁体   English

从稀疏稀疏矩阵中找到N个随机零元素

[英]Finding N random zero elements from a scipy sparse matrix

I have a large sparse matrix, in the scipy lil_matrix format the size is 281903x281903, it is an adjacency matrix https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html 我有一个大的稀疏矩阵,大小为281903x281903,是scipy lil_matrix格式,它是一个邻接矩阵https://docs.scipy.org/doc/scipy/reference/generation/scipy.sparse.lil_matrix.html

I need a reliable way to get N indexes that are zero. 我需要一种可靠的方法来获取N个零索引。 I can't just draw all zero indexes and then choose random ones, since that makes my computer run out of memory. 我不能只绘制所有零索引,然后选择随机的索引,因为那会使我的计算机内存不足。 Are there a way of identifying N random indexes without having to trawl through the entire data-structure? 有没有一种方法可以识别N个随机索引而不必遍历整个数据结构?

I currently get 10% of the non zero indices the following way (Y is my sparse matrix): 我目前通过以下方式获得非零索引的10%(Y是我的稀疏矩阵):

percent = 0.1

oneIdx = Y.nonzero()
numberOfOnes = len(oneIdx[0])
maskLength = int(math.floor(numberOfOnes * percent))
idxOne = np.array(random.sample(range(0,numberOfOnes), maskLength))

maskOne = tuple(np.asarray(oneIdx)[:,idxOne])

I am looking for way to get a mask with the same length as the non zero mask, but with zeros... 我正在寻找一种方法来获得与非零掩码长度相同但长度为零的掩码...

Here is an approach based on rejection sampling. 这是一种基于剔除采样的方法。 Based on the numbers in your example, an index chosen uniformly at random is likely to be zero, so this will be a relatively efficient approach. 根据示例中的数字,随机选择的均匀索引可能为零,因此这是一种相对有效的方法。

from scipy import sparse

dims = (281903, 281903)

mat = sparse.lil_matrix(dims, dtype=np.int)

for _ in range(1000):
    x, y = np.random.randint(0, dims[0], 2)
    mat[x, y] = 1


def sample_zero_forever(mat):
    nonzero_or_sampled = set(zip(*mat.nonzero()))
    while True:
        t = tuple(np.random.randint(0, mat.shape[0], 2))
        if t not in nonzero_or_sampled:
            yield t
            nonzero_or_sampled.add(t)


def sample_zero_n(mat, n=100):
    itr = sample_zero_forever(mat)
    return [next(itr) for _ in range(n)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM