简体   繁体   English

佐证2大Python字典

[英]Corroborating 2 large Python dictionaries

Say I have a 2 dictionaries, each with around 100000 entries (each can be of different length):假设我有 2 个字典,每个字典都有大约 100000 个条目(每个条目的长度可以不同):

dict1 = {"a": ["w", "x"], "b":["y"], "c":["z"] ...}
dict2 = {"x": ["a", "b"], "y":["b", "d"], "z":["d"] ...}

I need to perform an operation using these two dictionaries:我需要使用这两个字典执行操作:

  • Treat each dict item as a set of mapping (ie list of all mappings in dict1 would be "a"->"w" , "a"->"x" , "b"->"y" and "c"->"z" )将每个 dict 项视为一组映射(即dict1中所有映射的列表将是"a"->"w""a"->"x""b"->"y""c"->"z" )
  • Only keep mappings in dict1 if the reverse mapping exists in dict2 .如果dict2中存在反向映射,则仅在dict1中保留映射。

The resulting dictionary would be: {"a": ["x"], "b", ["y"]}结果字典将是: {"a": ["x"], "b", ["y"]}

My current solution uses 2 m*n all zeros dataframes where m and n are the lengths of dict1 and dict2 respectively and the index labels are the keys in dict1 and the column labels are the keys in dict2 .我当前的解决方案使用 2 m*n全零数据帧,其中mn分别是dict1dict2的长度,索引标签是dict1中的键,列标签是dict2中的键。

For the first dataframe, I insert a 1 at each value where the index label -> column label represent a mapping in dict1 .对于第一个 dataframe,我在索引 label -> 列 label 表示dict1中的映射的每个值处插入一个1 For the second dataframe, I insert a 1 at each value where the column label -> index label represent a mapping in dict2 .对于第二个 dataframe,我在每个值处插入一个1 ,其中列 label -> 索引 label 表示dict2中的映射。

I then perform an element-size product between the two dataframes, which only leaves values that have a mapping "a1"->"x1" in dict1 and "x1"->"a1" in dict2 .然后我在两个数据帧之间执行元素大小的乘积,它只留下在 dict1 中具有映射"a1"->"x1"和在dict1中具有映射 " "x1"->"a1"dict2

However, this takes up way too much memory and is very expensive.但是,这占用了太多 memory 并且非常昂贵。 Is there an alternative algorithm I can use?有没有我可以使用的替代算法?

How about to use the same idea, but replace a sparse matrix you're using with a set of key pairs?如何使用相同的想法,但用一组密钥对替换您正在使用的稀疏矩阵? Something like:就像是:

import collections
def fn(dict1, dict2):
    mapping_set = set()
    for k, vv in dict2.items():
        for v in vv:
            mapping_set.add((k, v))
    result_dict = collections.defaultdict(list)
    for k, vv in dict1.items():
        for v in vv:
            if (v, k) in mapping_set:  # Note reverse order of k and v
                result_dict[k].append(v)
    return result_dict

Update : It will use O(total number of values in dict2) of memory and O(total number of values in dict1) + O(total number of values in dict2) time - both a linear.更新:它将使用 memory 的 O(dict2 O(total number of values in dict2)O(total number of values in dict1) + O(total number of values in dict2)时间 - 两者都是线性的。 It's not possible to solve the problem algorithmically faster as every value in every dict has to be visited at least once.由于每个字典中的每个值都必须至少访问一次,因此不可能通过算法更快地解决问题。

Given that you have python objects to begin with, you may want to stay in the python domain.鉴于您有 python 对象开始,您可能希望留在 python 域中。 If you need to iterate through the entire dict to create your matrix anyway, you may find that filtering in-place doesn't take much longer.如果您无论如何都需要遍历整个 dict 来创建矩阵,您可能会发现就地过滤不会花费太多时间。

default = ()
result = {k: v for k, v in dict1.items() if k in dict2.get(v, default)}

If your list are short, this will be totally fine.如果你的清单很短,这完全没问题。 If they contain many elements, linear search will start to compare to the overhead of set lookup.如果它们包含许多元素,线性搜索将开始与集合查找的开销进行比较。 In that case, you may want to preprocess dict2 to contain set s rather than lists:在这种情况下,您可能需要预处理dict2以包含set而不是列表:

dict2 = {k: set(v) for k, v in dict2.items}

or in-place或就地

for k, v in dict2.items():
    dict2[k] = set(v)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM