简体   繁体   English

如何找到用于匹配两个字典中的值的字典键?

[英]How can I find dict keys for matching values in two dicts?

I have two dictionaries mapping IDs to values. 我有两个字典将ID映射到值。 For simplicity, lets say those are the dictionaries: 为了简单起见,可以说这些是字典:

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

As named, the dictionaries are not symmetrical. 顾名思义,字典不是对称的。 I would like to get a dictionary of keys from dictionaries d_source and d_target whose values match. 我想从值相匹配的字典d_sourced_target获取的字典。 The resulting dictionary would have d_source keys as its own keys, and d_target keys as that keys value (in either a list , tuple or set format). 结果字典将以d_source键作为其自己的键,并以d_target键作为该键的值(以listtupleset格式)。

This would be The expected returned value for the above example should be the following list: 这将是上面的示例的预期返回值应该是以下列表:

{'a': ('1', 'A'),
 'b': ('B',),
 'c': ('C',),
 '3': ('C',)}

There are two somewhat similar questions , but those solutions can't be easily applied to my question. 有两个相似的 问题 ,但是这些解决方案不能轻易地应用于我的问题。

Some characteristics of the data: 数据的一些特征:

  1. Source would usually be smaller than target. 源通常小于目标。 Having roughly few thousand sources (tops) and a magnitude more targets. 大约有数千个来源(顶部)和更多的目标。
  2. Duplicates in the same dict (both d_source and d_target ) are not too likely on values. 同一字典( d_sourced_target )中的重复值不太可能出现。
  3. matches are expected to be found for (a rough estimate) not more than 50% than d_source items. (粗略估计)期望找到的匹配项不超过d_source项目的50%。
  4. All keys are integers. 所有键都是整数。

What is the best (performance wise) solution to this problem? 解决此问题的最佳(性能明智的)解决方案是什么? Modeling data into other datatypes for improved performance is totally ok, even when using third party libraries (i'm thinking numpy ) 即使使用第三方库,将数据建模为其他数据类型以提高性能也是可以的(我认为是numpy

All answers have O(n^2) efficiency which isn't very good so I thought of answering myself. 所有答案的O(n^2)效率都不是很好,所以我想到了自己回答。

I use 2(source_len) + 2(dict_count)(dict_len) memory and I have O(2n) efficiency which is the best you can get here I believe. 我使用2(source_len) + 2(dict_count)(dict_len)内存,我具有O(2n)效率,我相信这是您能得到的最好的效率。

Here you go: 干得好:

from collections import defaultdict

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

def merge_dicts(source_dict, *rest):
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

new_dict = merge_dicts(d_source, d_target)

By the way, I'm using a tuple in order not to link the resulting lists together. 顺便说一句,我正在使用一个元组,以便不将结果列表链接在一起。


As you've added specifications for the data, here's a closer matching solution: 添加数据规范后,下面是一个更匹配的解决方案:

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

def second_merge_dicts(source_dict, *rest):
    """Optimized for ~50% source match due to if statement addition.

    Also uses less memory.
    """
    unique_values = set(source_dict.values())
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            if v in unique_values:
                flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

new_dict = second_merge_dicts(d_source, d_target)
from collections import defaultdict
from pprint import pprint

d_source  = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

d_result = defaultdict(list)
{d_result[a].append(b) for a in d_source for b in d_target if d_source[a] == d_target[b]}

pprint(d_result)

Output: 输出:

{'3': ['C'],
 'a': ['A', '1'],
 'b': ['B'],
 'c': ['C']}

Timing results: 计时结果:

from collections import defaultdict
from copy import deepcopy
from random import randint
from timeit import timeit


def Craig_match(source, target):
    result = defaultdict(list)
    {result[a].append(b) for a in source for b in target if source[a] == target[b]}
    return result

def Bharel_match(source_dict, *rest):
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

def modified_Bharel_match(source_dict, *rest):
    """Optimized for ~50% source match due to if statement addition.

    Also uses less memory.
    """
    unique_values = set(source_dict.values())
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            if v in unique_values:
                flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

# generate source, target such that:
# a) ~10% duplicate values in source and target
# b) 2000 unique source keys, 20000 unique target keys
# c) a little less than 50% matches source value to target value
# d) numeric keys and values
source = {}
for k in range(2000):
    source[k] = randint(0, 1800)
target = {}
for k in range(20000):
    if k < 1000:
        target[k] = randint(0, 2000)
    else:
        target[k] = randint(2000, 19000)

best_time = {}
approaches = ('Craig', 'Bharel', 'modified_Bharel')
for a in approaches:
    best_time[a] = None

for _ in range(3):
    for approach in approaches:
        test_source = deepcopy(source)
        test_target = deepcopy(target)

        statement = 'd=' + approach + '_match(test_source,test_target)'
        setup = 'from __main__ import test_source, test_target, ' + approach + '_match'
        t = timeit(stmt=statement, setup=setup, number=1)
        if not best_time[approach] or (t < best_time[approach]):
            best_time[approach] = t

for approach in approaches:
    print(approach, ':', '%0.5f' % best_time[approach])

Output: 输出:

Craig : 7.29259
Bharel : 0.01587
modified_Bharel : 0.00682

Here is another solution. 这是另一种解决方案。 There are a lot of ways to do this 有很多方法可以做到这一点

for key1 in d1:
    for key2 in d2:
        if d1[key1] == d2[key2]:
            stuff

Note that you can use any name for key1 and key2. 请注意,您可以为key1和key2使用任何名称。

This maybe "cheating" in some regards, although if you are looking for the matching values of the keys regardless of the case sensitivity then you might be able to do: 在某些方面这可能是“作弊”,尽管如果您要查找键的匹配值,而不考虑大小写敏感性,那么您也许可以做到:

import sets

aa = {'a': 1, 'b': 2, 'c':3}
bb = {'A': 1, 'B': 2, 'd': 3}

bbl = {k.lower():v for k,v in bb.items()}

result = {k:k.upper() for k,v in aa.iteritems() & bbl.viewitems()}
print( result )

Output: 输出:

{'a': 'A', 'b': 'B'}

The bbl declaration changes the bb keys into lowercase (it could be either aa , or bb ). bbl声明将bb键更改为小写字母(可以是aabb )。

* I only tested this on my phone, so just throwing this idea out there I suppose... Also, you've changed your question radically since I began composing my answer, so you get what you get. *我只在手机上测试过这个东西,所以我只想把这个想法扔出去...而且,自从我开始编写答案以来,您已经彻底改变了您的问题,所以您得到了。

It is up to you to determine the best solution. 由您决定最佳解决方案。 Here is a solution: 这是一个解决方案:

def dicts_to_tuples(*dicts):
    result = {}
    for d in dicts:
        for k,v in d.items():
            result.setdefault(v, []).append(k)
    return [tuple(v) for v in result.values() if len(v) > 1]

d1 = {'a': 1, 'b': 2, 'c':3}
d2 = {'A': 1, 'B': 2}
print dicts_to_tuples(d1, d2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM