简体   繁体   English

Python:基于一个特定键相同的所有字典键值组合

[英]Python: Combine all dict key values based on one particular key being the same

I know there are a million questions like this, I just can't find an answer that works for me.我知道有一百万个这样的问题,我只是找不到适合我的答案。

I have this:我有这个:

list1 =   [{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H']}, {'assembly_id': '1', 'asym_id_list': ['C', 'D', 'F', 'I', 'J']}, {'assembly_id':2,'asym_id_list':['D,C'],'auth_id_list':['C','V']}]

if the assembly_ids are the same, I want to combine the other same keys in the dict.如果 assembly_ids 相同,我想在字典中组合其他相同的键。

In this example, assembly_id 1 appears twice, so the input above would turn into:在这个例子中,assembly_id 1 出现了两次,所以上面的输入会变成:

[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H','C', 'D', 'F', 'I', 'J']},{'assembly_id':2,'asym_id_list:['D,C'],'auth_id_list':['C','V']}]

In theory there can be n assembly_ids (ie assembly 1 could appear in the dict 10 or 20 times, not just 2) and there can be up to two other lists to combine (asym_id_list and auth_id_list).理论上可以有 n 个 assembly_id(即程序集 1 可能出现在 dict 中 10 或 20 次,而不仅仅是 2 次),并且最多可以有两个其他列表要组合(asym_id_list 和 auth_id_list)。

I was looking at this method:我在看这个方法:

new_dict = {}
assembly_list = [] #to keep track of assemblies already seen
for dict_name in list1: #for each dict in the list
        if dict_name['assembly_id'] not in assembly_list: #if the assembly id is new
                new_dict['assembly_id'] = dict_name #this line is wrong, add the entry to new_dict
                assembly_list.append(new_dict['assembly_id']) #append the id to 'assembly_list'
        else:
                new_dict['assembly_id'].append(dict_name) #else if it's already seen, append the dictionaries together, this is wrong
print(new_dict)

The output is wrong: output 错误:

{'assembly_id': {'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']}}

But I think the idea is right, that I should open a new list and dict, and if not seen before, append;但我认为这个想法是对的,我应该打开一个新列表和字典,如果以前没有见过,append; whereas if it has been seen before...combine?而如果它以前见过......结合? But it's just the specifics I'm not getting?但这只是我没有得到的细节?

Use a dict keyed on assembly_id to collect all the data for a given key;使用以assembly_id为键的字典来收集给定键的所有数据; you can then go back and generate a list of dicts in the original format if needed.然后,您可以返回 go 并在需要时生成原始格式的字典列表。

>>> from collections import defaultdict
>>> from typing import Dict, List
>>> id_lists: Dict[str, List[str]] = defaultdict(list)
>>> for d in list1:
...     id_lists[d['assembly_id']].extend(d['asym_id_list'])
...
>>> combined_list = [{
...     'assembly_id': id, 'asym_id_list': id_list
... } for id, id_list in id_lists.items()]
>>> combined_list
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']}, {'assembly_id': 2, 'asym_id_list': ['D,C']}]
>>>

(edit) didn't see the bit about auth_id_lists because it's hidden in the scroll in the original code -- same strategy applies, just either use two dicts in the first step or make it a dict of some collection of lists (eg a dict of dicts of lists, with the outer dict keyed on assembly_id values and the inner dict keyed on the original field name). (编辑)没有看到关于auth_id_lists的内容,因为它隐藏在原始代码的滚动中 - 相同的策略适用,只需在第一步中使用两个字典或使其成为某些列表集合的字典(例如字典列表的字典,外部字典键在assembly_id值上,内部字典键在原始字段名称上)。

You are logically thinking correctly, we can use a dictionary m which contains key, value pairs of assembly_id and its corresponding dictionary to keep track of visited assembly_ids , whenever a new assembly_id is encountered we add it to the dictionary m otherwise if its already contain the assembly_id we just extend the asym_id_list , auth_id_list for that assembly_id :您在逻辑上思考正确,我们可以使用字典m ,其中包含assembly_id的键、值对及其对应的字典来跟踪访问过的assembly_ids ,每当遇到新的assembly_id时,我们将其添加到字典m中,否则如果它已经包含assembly_id我们只是扩展asym_id_listauth_id_list为该assembly_id

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', []) + d['asym_id_list']
            elif 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', []) + d['auth_id_list']
        else:
            m[key] = d
    return list(m.values())

Result:结果:

# merge(list1)
[
    {
        'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']
    },
    {
        'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']
    }
]

@Samwise has provided a good answer to the question you asked and this is not intended to replace that. @Samwise 为您提出的问题提供了一个很好的答案,这并不是要取代它。 However, I am going to make a suggestion to the way you are keeping the data after the merge.但是,我将对您在合并后保留数据的方式提出建议。 I would put this in a comment but there is no way to keep code formatting in a comment and it is a bit too big as well.我会把它放在评论中,但是没有办法在评论中保留代码格式,而且它也有点太大了。

Before that, I think that you have a typo in your example data.在此之前,我认为您的示例数据中有错字。 I think that you meant the 'D,C' in 'assembly_id':2,'asym_id_list':['D,C'] to be separate strings like this: 'assembly_id':2,'asym_id_list':['D', 'C'] .我认为您的意思是'D,C' 'assembly_id':2,'asym_id_list':['D,C']中的 'D,C' 是这样的单独字符串: 'assembly_id':2,'asym_id_list':['D', 'C'] I am going to assume that below, but if not it does not change any of the code or comments.我将在下面假设,但如果不是,它不会更改任何代码或注释。

Instead of the merged structure being a list of dictionaries like this:而不是合并的结构是这样的字典列表:

merge_l = [
            {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          ]

Instead, I would recommend not using a list as the top level structure, but instead using a dictionary keyed by the value of the assembly_id.相反,我建议不要使用列表作为顶级结构,而是使用由 assembly_id 的值作为键的字典。 So it would be a dictionary whos values are dictionaries.所以这将是一个字典,其值是字典。 Like this:像这样:

merge_d = { '1': {'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

or if you want to keep the 'assembly_id' as well, like this:或者如果您也想保留“assembly_id”,如下所示:

merge_d = { '1': {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

That last one can be achieved by just changing the return from @Samwise's merge() method and just return m instead of converting the dict to a list.最后一个可以通过更改@Samwise 的merge()方法的返回值并return m而不是将 dict 转换为列表来实现。

One other comment on @Samwise code, just so you are aware of it, is that the combined lists can contain duplicates.对@Samwise 代码的另一条评论,只是为了让您知道,组合列表可以包含重复项。 So if the original data had asym_id_list': ['A', 'B'] in one entry and asym_id_list': ['B', 'C'] in another, the combined list would contain asym_id_list': ['A', 'B', 'B', 'C'] .因此,如果原始数据在一个条目中有asym_id_list': ['A', 'B']并且在另一个条目中有asym_id_list': ['B', 'C'] ,则组合列表将包含asym_id_list': ['A', 'B', 'B', 'C'] That could be what you want, but if you want to avoid that you could use sets instead of lists for the internal container for asym_id and auth_id containers.这可能是您想要的,但如果您想避免这种情况,您可以为 asym_id 和 auth_id 容器的内部容器使用集合而不是列表。

In @Samwise answer, change it something like this:在@Samwise 答案中,将其更改为:

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', set()) | set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', set()) | set(d['auth_id_list'])
        else:
            m[key] = {'assembly_id': d['assembly_id']}
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = set(d['auth_id_list'])
    return m

If you go this way, you might want to reconsider the key names 'asym_id_list' and 'auth_id_list' since they are sets not lists.如果您以这种方式使用 go,您可能需要重新考虑键名'asym_id_list''auth_id_list'因为它们是集合而不是列表。 But that may be constrained by the other code around this and what it expects.但这可能会受到围绕此的其他代码及其预期的限制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM