简体   繁体   English

从许多(大)字典中构建一个(大)稀疏矩阵

[英]Building a (big) sparse matrix out of many (big) dicts

I have a series of dicts of the form:我有一系列形式的字典:

dict_k:{(i,j):d_ij}

where i and j are integers corresponding to the indices in the sparse matrix to be built and d_ij a float number.其中 i 和 j 是对应于要构建的稀疏矩阵中的索引的整数,d_ij 是一个浮点数。 Dicionnaries can contain up to O(1 Million) values字典最多可以包含 O(100 万) 个值

I have about 160 such dictionaries, each dictionary is about 16 MegaBytes.我有大约 160 个这样的词典,每个词典大约 16 兆字节。 Dictionaries may or may not contain duplicate keys with duplicate values, that is you could find the (0,0):1.25 key/value couple in two different dictionaries for example.字典可能包含也可能不包含具有重复值的重复键,例如,您可以在两个不同的字典中找到 (0,0):1.25 键/值对。

I want to build a sparse matrix containing out of these dictionaries.我想构建一个包含这些字典的稀疏矩阵。 The entries of the matrix would be given by all of the {(i,j):d_ij} couples in all of the dictionaries.矩阵的条目将由所有词典中的所有 {(i,j):d_ij} 对给出。

My naive approach is to build a huge dictionary out of all dictionaries like so:我天真的方法是从所有字典中构建一个巨大的字典,如下所示:

bigDict={}
for i in range(160):
    with open(dictionnary_path, "rb") as fp:   
        bigDict.update(pickle.load(fp))

Then retrieve the columns/row indices and their corresponding coefficients to build a scipy coo format sparse matrix with the line ( coo_matrix((coefficients, (rows, columns)), [shape=(M, N)]) ).然后检索列/行索引及其相应的系数,以构建一个 scipy coo 格式的稀疏矩阵,其中包含行 ( coo_matrix((coefficients, (rows, columns)), [shape=(M, N)]) )。

But this makes my computer freeze when building the huge dict, do you have any smarter way of doing it?但这会让我的电脑在构建巨大的 dict 时死机,你有什么更聪明的方法吗? My end end goal is to use this sparse matrix to perform matrix vector multiplications.我的最终目标是使用这个稀疏矩阵来执行矩阵向量乘法。

an example dict:一个例子字典:

 {(0, 1704510): 0.125,
 (0, 1704511): 0.089,
 (0, 1704512): 0.044,
 (0, 1704513): 0.021,
 (0, 1704514): 0.037,
 (0, 1704515): 0.032,
 (0, 1704516): 0.021,
 (0, 1704517): 0.013,
 (0, 502593): 0.089,
 (0, 502594): 0.125,
 (0, 502595): 0.089,
 (0, 502596): 0.044,
 (0, 502597): 0.032,
 (0, 502598): 0.037,
 (0, 502599): 0.032,
 (0, 502600): 0.021,
 (0, 129844): 0.044,
 (0, 129845): 0.089,
 (0, 129846): 0.125,
 (0, 129847): 0.089,
 (0, 129848): 0.021,
 (0, 129849): 0.032,
 (0, 129850): 0.037,
 (0, 129851): 0.032,
 (0, 28314): 0.021,
 (0, 28315): 0.044,
 (0, 28316): 0.089,
 (0, 28317): 0.125,
 (0, 28318): 0.013,
 (0, 28319): 0.021,
 (0, 28320): 0.032,
 (0, 28321): 0.037,
 (0, 4917): 1.0,
 (0, 4918): 0.354,
 (0, 4919): 0.089,
 (0, 4920): 0.032,
 (0, 4921): 0.125,
 (0, 4922): 0.089,
 (0, 4923): 0.044,
 (0, 4924): 0.021,
 (0, 615): 0.354,
 (0, 616): 1.0,
 (0, 617): 0.354,
 (0, 618): 0.089,
 (0, 619): 0.089,
 (0, 620): 0.125,
 (0, 621): 0.089,
 (0, 622): 0.044,
 (0, 45): 0.089,
 (0, 46): 0.354,
 (0, 47): 1.0,
 (0, 48): 0.354,
 (0, 49): 0.044,
 (0, 50): 0.089,
 (0, 51): 0.125,
...
 (13, 675): 0.354,
 (13, 680): 0.089,
 (13, 684): 0.089,
 (13, 685): 0.125,
 ...}

The matrix is of size 5Millionx5Million approximately矩阵的大小约为 500 万 x 500 万

pickle seems to create different objects for equal ints/floats, so probably most of your memory usage comes from the keys and values, not from the dictionaries themselves. pickle似乎为相等的整数/浮点数创建了不同的对象,因此您的大部分 memory 用法可能来自键和值,而不是来自字典本身。 You could try to deduplicate them, for example:您可以尝试对它们进行重复数据删除,例如:

dedup = {}.setdefault
def d(x):
    return dedup(x, x)

bigDict={}
for i in range(160):
    with open(dictionnary_path, "rb") as fp:   
        for (x, y), z in pickle.load(fp).items():
            bigDict[d(x), d(y)] = d(z)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM