Python：基于一个特定键相同的所有字典键值组合

Question

I know there are a million questions like this, I just can't find an answer that works for me.我知道有一百万个这样的问题，我只是找不到适合我的答案。

I have this:我有这个：

list1 =   [{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H']}, {'assembly_id': '1', 'asym_id_list': ['C', 'D', 'F', 'I', 'J']}, {'assembly_id':2,'asym_id_list':['D,C'],'auth_id_list':['C','V']}]

if the assembly_ids are the same, I want to combine the other same keys in the dict.如果 assembly_ids 相同，我想在字典中组合其他相同的键。

In this example, assembly_id 1 appears twice, so the input above would turn into:在这个例子中，assembly_id 1 出现了两次，所以上面的输入会变成：

[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H','C', 'D', 'F', 'I', 'J']},{'assembly_id':2,'asym_id_list:['D,C'],'auth_id_list':['C','V']}]

In theory there can be n assembly_ids (ie assembly 1 could appear in the dict 10 or 20 times, not just 2) and there can be up to two other lists to combine (asym_id_list and auth_id_list).理论上可以有 n 个 assembly_id（即程序集 1 可能出现在 dict 中 10 或 20 次，而不仅仅是 2 次），并且最多可以有两个其他列表要组合（asym_id_list 和 auth_id_list）。

I was looking at this method:我在看这个方法：

new_dict = {}
assembly_list = [] #to keep track of assemblies already seen
for dict_name in list1: #for each dict in the list
        if dict_name['assembly_id'] not in assembly_list: #if the assembly id is new
                new_dict['assembly_id'] = dict_name #this line is wrong, add the entry to new_dict
                assembly_list.append(new_dict['assembly_id']) #append the id to 'assembly_list'
        else:
                new_dict['assembly_id'].append(dict_name) #else if it's already seen, append the dictionaries together, this is wrong
print(new_dict)

The output is wrong: output 错误：

{'assembly_id': {'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']}}

But I think the idea is right, that I should open a new list and dict, and if not seen before, append;但我认为这个想法是对的，我应该打开一个新列表和字典，如果以前没有见过，append； whereas if it has been seen before...combine?而如果它以前见过......结合？ But it's just the specifics I'm not getting?但这只是我没有得到的细节？

Answer 1

Use a dict keyed on assembly_id to collect all the data for a given key;使用以assembly_id为键的字典来收集给定键的所有数据； you can then go back and generate a list of dicts in the original format if needed.然后，您可以返回 go 并在需要时生成原始格式的字典列表。

>>> from collections import defaultdict
>>> from typing import Dict, List
>>> id_lists: Dict[str, List[str]] = defaultdict(list)
>>> for d in list1:
...     id_lists[d['assembly_id']].extend(d['asym_id_list'])
...
>>> combined_list = [{
...     'assembly_id': id, 'asym_id_list': id_list
... } for id, id_list in id_lists.items()]
>>> combined_list
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']}, {'assembly_id': 2, 'asym_id_list': ['D,C']}]
>>>

(edit) didn't see the bit about auth_id_lists because it's hidden in the scroll in the original code -- same strategy applies, just either use two dicts in the first step or make it a dict of some collection of lists (eg a dict of dicts of lists, with the outer dict keyed on assembly_id values and the inner dict keyed on the original field name). （编辑）没有看到关于auth_id_lists的内容，因为它隐藏在原始代码的滚动中 - 相同的策略适用，只需在第一步中使用两个字典或使其成为某些列表集合的字典（例如字典列表的字典，外部字典键在assembly_id值上，内部字典键在原始字段名称上）。

Answer 2

You are logically thinking correctly, we can use a dictionary m which contains key, value pairs of assembly_id and its corresponding dictionary to keep track of visited assembly_ids , whenever a new assembly_id is encountered we add it to the dictionary m otherwise if its already contain the assembly_id we just extend the asym_id_list , auth_id_list for that assembly_id :您在逻辑上思考正确，我们可以使用字典m ，其中包含assembly_id的键、值对及其对应的字典来跟踪访问过的assembly_ids ，每当遇到新的assembly_id时，我们将其添加到字典m中，否则如果它已经包含assembly_id我们只是扩展asym_id_list ， auth_id_list为该assembly_id ：

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', []) + d['asym_id_list']
            elif 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', []) + d['auth_id_list']
        else:
            m[key] = d
    return list(m.values())

Result:结果：

# merge(list1)
[
    {
        'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']
    },
    {
        'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']
    }
]

Answer 3

@Samwise has provided a good answer to the question you asked and this is not intended to replace that. @Samwise 为您提出的问题提供了一个很好的答案，这并不是要取代它。 However, I am going to make a suggestion to the way you are keeping the data after the merge.但是，我将对您在合并后保留数据的方式提出建议。 I would put this in a comment but there is no way to keep code formatting in a comment and it is a bit too big as well.我会把它放在评论中，但是没有办法在评论中保留代码格式，而且它也有点太大了。

Before that, I think that you have a typo in your example data.在此之前，我认为您的示例数据中有错字。 I think that you meant the 'D,C' in 'assembly_id':2,'asym_id_list':['D,C'] to be separate strings like this: 'assembly_id':2,'asym_id_list':['D', 'C'] .我认为您的意思是'D,C' 'assembly_id':2,'asym_id_list':['D,C']中的 'D,C' 是这样的单独字符串： 'assembly_id':2,'asym_id_list':['D', 'C'] 。 I am going to assume that below, but if not it does not change any of the code or comments.我将在下面假设，但如果不是，它不会更改任何代码或注释。

Instead of the merged structure being a list of dictionaries like this:而不是合并的结构是这样的字典列表：

merge_l = [
            {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          ]

Instead, I would recommend not using a list as the top level structure, but instead using a dictionary keyed by the value of the assembly_id.相反，我建议不要使用列表作为顶级结构，而是使用由 assembly_id 的值作为键的字典。 So it would be a dictionary whos values are dictionaries.所以这将是一个字典，其值是字典。 Like this:像这样：

merge_d = { '1': {'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

or if you want to keep the 'assembly_id' as well, like this:或者如果您也想保留“assembly_id”，如下所示：

merge_d = { '1': {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

That last one can be achieved by just changing the return from @Samwise's merge() method and just return m instead of converting the dict to a list.最后一个可以通过更改@Samwise 的merge()方法的返回值并return m而不是将 dict 转换为列表来实现。

One other comment on @Samwise code, just so you are aware of it, is that the combined lists can contain duplicates.对@Samwise 代码的另一条评论，只是为了让您知道，组合列表可以包含重复项。 So if the original data had asym_id_list': ['A', 'B'] in one entry and asym_id_list': ['B', 'C'] in another, the combined list would contain asym_id_list': ['A', 'B', 'B', 'C'] .因此，如果原始数据在一个条目中有asym_id_list': ['A', 'B']并且在另一个条目中有asym_id_list': ['B', 'C'] ，则组合列表将包含asym_id_list': ['A', 'B', 'B', 'C'] 。 That could be what you want, but if you want to avoid that you could use sets instead of lists for the internal container for asym_id and auth_id containers.这可能是您想要的，但如果您想避免这种情况，您可以为 asym_id 和 auth_id 容器的内部容器使用集合而不是列表。

In @Samwise answer, change it something like this:在@Samwise 答案中，将其更改为：

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', set()) | set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', set()) | set(d['auth_id_list'])
        else:
            m[key] = {'assembly_id': d['assembly_id']}
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = set(d['auth_id_list'])
    return m

If you go this way, you might want to reconsider the key names 'asym_id_list' and 'auth_id_list' since they are sets not lists.如果您以这种方式使用 go，您可能需要重新考虑键名'asym_id_list'和'auth_id_list'因为它们是集合而不是列表。 But that may be constrained by the other code around this and what it expects.但这可能会受到围绕此的其他代码及其预期的限制。

Python：基于一个特定键相同的所有字典键值组合

问题描述

3 个解决方案

解决方案1
1 2020-06-04 15:53:06

解决方案2
1 已采纳 2020-06-04 15:58:03

解决方案3
0 2020-06-04 17:29:14

Python：基于一个特定键相同的所有字典键值组合

问题描述

3 个解决方案

解决方案1 1 2020-06-04 15:53:06

解决方案2 1 已采纳 2020-06-04 15:58:03

解决方案3 0 2020-06-04 17:29:14

解决方案1
1 2020-06-04 15:53:06

解决方案2
1 已采纳 2020-06-04 15:58:03

解决方案3
0 2020-06-04 17:29:14