[英]Python: Combine all dict key values based on one particular key being the same
I know there are a million questions like this, I just can't find an answer that works for me.我知道有一百万个这样的问题,我只是找不到适合我的答案。
I have this:我有这个:
list1 = [{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H']}, {'assembly_id': '1', 'asym_id_list': ['C', 'D', 'F', 'I', 'J']}, {'assembly_id':2,'asym_id_list':['D,C'],'auth_id_list':['C','V']}]
if the assembly_ids are the same, I want to combine the other same keys in the dict.如果 assembly_ids 相同,我想在字典中组合其他相同的键。
In this example, assembly_id 1 appears twice, so the input above would turn into:在这个例子中,assembly_id 1 出现了两次,所以上面的输入会变成:
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H','C', 'D', 'F', 'I', 'J']},{'assembly_id':2,'asym_id_list:['D,C'],'auth_id_list':['C','V']}]
In theory there can be n assembly_ids (ie assembly 1 could appear in the dict 10 or 20 times, not just 2) and there can be up to two other lists to combine (asym_id_list and auth_id_list).理论上可以有 n 个 assembly_id(即程序集 1 可能出现在 dict 中 10 或 20 次,而不仅仅是 2 次),并且最多可以有两个其他列表要组合(asym_id_list 和 auth_id_list)。
I was looking at this method:我在看这个方法:
new_dict = {}
assembly_list = [] #to keep track of assemblies already seen
for dict_name in list1: #for each dict in the list
if dict_name['assembly_id'] not in assembly_list: #if the assembly id is new
new_dict['assembly_id'] = dict_name #this line is wrong, add the entry to new_dict
assembly_list.append(new_dict['assembly_id']) #append the id to 'assembly_list'
else:
new_dict['assembly_id'].append(dict_name) #else if it's already seen, append the dictionaries together, this is wrong
print(new_dict)
The output is wrong: output 错误:
{'assembly_id': {'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']}}
But I think the idea is right, that I should open a new list and dict, and if not seen before, append;但我认为这个想法是对的,我应该打开一个新列表和字典,如果以前没有见过,append; whereas if it has been seen before...combine?
而如果它以前见过......结合? But it's just the specifics I'm not getting?
但这只是我没有得到的细节?
Use a dict keyed on assembly_id
to collect all the data for a given key;使用以
assembly_id
为键的字典来收集给定键的所有数据; you can then go back and generate a list of dicts in the original format if needed.然后,您可以返回 go 并在需要时生成原始格式的字典列表。
>>> from collections import defaultdict
>>> from typing import Dict, List
>>> id_lists: Dict[str, List[str]] = defaultdict(list)
>>> for d in list1:
... id_lists[d['assembly_id']].extend(d['asym_id_list'])
...
>>> combined_list = [{
... 'assembly_id': id, 'asym_id_list': id_list
... } for id, id_list in id_lists.items()]
>>> combined_list
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']}, {'assembly_id': 2, 'asym_id_list': ['D,C']}]
>>>
(edit) didn't see the bit about auth_id_lists
because it's hidden in the scroll in the original code -- same strategy applies, just either use two dicts in the first step or make it a dict of some collection of lists (eg a dict of dicts of lists, with the outer dict keyed on assembly_id
values and the inner dict keyed on the original field name). (编辑)没有看到关于
auth_id_lists
的内容,因为它隐藏在原始代码的滚动中 - 相同的策略适用,只需在第一步中使用两个字典或使其成为某些列表集合的字典(例如字典列表的字典,外部字典键在assembly_id
值上,内部字典键在原始字段名称上)。
You are logically thinking correctly, we can use a dictionary m
which contains key, value pairs of assembly_id
and its corresponding dictionary to keep track of visited assembly_ids
, whenever a new assembly_id
is encountered we add it to the dictionary m
otherwise if its already contain the assembly_id
we just extend the asym_id_list
, auth_id_list
for that assembly_id
:您在逻辑上思考正确,我们可以使用字典
m
,其中包含assembly_id
的键、值对及其对应的字典来跟踪访问过的assembly_ids
,每当遇到新的assembly_id
时,我们将其添加到字典m
中,否则如果它已经包含assembly_id
我们只是扩展asym_id_list
, auth_id_list
为该assembly_id
:
def merge(dicts):
m = {} # keeps track of the visited assembly_ids
for d in dicts:
key = d['assembly_id'] # assembly_id is used as merge/grouping key
if key in m:
if 'asym_id_list' in d:
m[key]['asym_id_list'] = m[key].get('asym_id_list', []) + d['asym_id_list']
elif 'auth_id_list' in d:
m[key]['auth_id_list'] = m[key].get('auth_id_list', []) + d['auth_id_list']
else:
m[key] = d
return list(m.values())
Result:结果:
# merge(list1)
[
{
'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']
},
{
'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']
}
]
@Samwise has provided a good answer to the question you asked and this is not intended to replace that. @Samwise 为您提出的问题提供了一个很好的答案,这并不是要取代它。 However, I am going to make a suggestion to the way you are keeping the data after the merge.
但是,我将对您在合并后保留数据的方式提出建议。 I would put this in a comment but there is no way to keep code formatting in a comment and it is a bit too big as well.
我会把它放在评论中,但是没有办法在评论中保留代码格式,而且它也有点太大了。
Before that, I think that you have a typo in your example data.在此之前,我认为您的示例数据中有错字。 I think that you meant the
'D,C'
in 'assembly_id':2,'asym_id_list':['D,C']
to be separate strings like this: 'assembly_id':2,'asym_id_list':['D', 'C']
.我认为您的意思是
'D,C'
'assembly_id':2,'asym_id_list':['D,C']
中的 'D,C' 是这样的单独字符串: 'assembly_id':2,'asym_id_list':['D', 'C']
。 I am going to assume that below, but if not it does not change any of the code or comments.我将在下面假设,但如果不是,它不会更改任何代码或注释。
Instead of the merged structure being a list of dictionaries like this:而不是合并的结构是这样的字典列表:
merge_l = [
{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
{'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
]
Instead, I would recommend not using a list as the top level structure, but instead using a dictionary keyed by the value of the assembly_id.相反,我建议不要使用列表作为顶级结构,而是使用由 assembly_id 的值作为键的字典。 So it would be a dictionary whos values are dictionaries.
所以这将是一个字典,其值是字典。 Like this:
像这样:
merge_d = { '1': {'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
'2': {'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
}
or if you want to keep the 'assembly_id' as well, like this:或者如果您也想保留“assembly_id”,如下所示:
merge_d = { '1': {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
'2': {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
}
That last one can be achieved by just changing the return from @Samwise's merge()
method and just return m
instead of converting the dict to a list.最后一个可以通过更改@Samwise 的
merge()
方法的返回值并return m
而不是将 dict 转换为列表来实现。
One other comment on @Samwise code, just so you are aware of it, is that the combined lists can contain duplicates.对@Samwise 代码的另一条评论,只是为了让您知道,组合列表可以包含重复项。 So if the original data had
asym_id_list': ['A', 'B']
in one entry and asym_id_list': ['B', 'C']
in another, the combined list would contain asym_id_list': ['A', 'B', 'B', 'C']
.因此,如果原始数据在一个条目中有
asym_id_list': ['A', 'B']
并且在另一个条目中有asym_id_list': ['B', 'C']
,则组合列表将包含asym_id_list': ['A', 'B', 'B', 'C']
。 That could be what you want, but if you want to avoid that you could use sets instead of lists for the internal container for asym_id and auth_id containers.这可能是您想要的,但如果您想避免这种情况,您可以为 asym_id 和 auth_id 容器的内部容器使用集合而不是列表。
In @Samwise answer, change it something like this:在@Samwise 答案中,将其更改为:
def merge(dicts):
m = {} # keeps track of the visited assembly_ids
for d in dicts:
key = d['assembly_id'] # assembly_id is used as merge/grouping key
if key in m:
if 'asym_id_list' in d:
m[key]['asym_id_list'] = m[key].get('asym_id_list', set()) | set(d['asym_id_list'])
if 'auth_id_list' in d:
m[key]['auth_id_list'] = m[key].get('auth_id_list', set()) | set(d['auth_id_list'])
else:
m[key] = {'assembly_id': d['assembly_id']}
if 'asym_id_list' in d:
m[key]['asym_id_list'] = set(d['asym_id_list'])
if 'auth_id_list' in d:
m[key]['auth_id_list'] = set(d['auth_id_list'])
return m
If you go this way, you might want to reconsider the key names 'asym_id_list'
and 'auth_id_list'
since they are sets not lists.如果您以这种方式使用 go,您可能需要重新考虑键名
'asym_id_list'
和'auth_id_list'
因为它们是集合而不是列表。 But that may be constrained by the other code around this and what it expects.但这可能会受到围绕此的其他代码及其预期的限制。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.