[英]accelerate comparing dictionary keys and values to strings in list in python
Sorry if this is trivial I'm still learning but I have a list of dictionaries that looks as follow:抱歉,如果这是微不足道的,我仍在学习,但我有一个字典列表,如下所示:
[{'1102': ['00576', '00577', '00578', '00579', '00580', '00581']},
{'1102': ['00582', '00583', '00584', '00585', '00586', '00587']},
{'1102': ['00588', '00589', '00590', '00591', '00592', '00593']},
{'1102': ['00594', '00595', '00596', '00597', '00598', '00599']},
{'1102': ['00600', '00601', '00602', '00603', '00604', '00605']}
...]
it contains ~89000 dictionaries.它包含约 89000 个字典。 And I have a list containing 4473208 paths.我有一个包含 4473208 条路径的列表。 example:例子:
['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv',
'/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv',
...]
and what I want to do is group each path that contains the grouped values in the dict in the folder containing the key together.我想要做的是将包含键的文件夹中dict中包含分组值的每个路径组合在一起。
I tried using for loops like this:我尝试使用这样的 for 循环:
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for file in ct_paths:
for key, val in elem.items():
if (file[16:20] == key) and (any(x in file[21:26] for x in val)):
temp1.append(file)
grpd_cts.append(temp1)
but this takes around 30hours.但这需要大约 30 小时。 is there a way to make it more efficient?有没有办法让它更有效率? any itertools function or something?任何 itertools 功能或什么?
Thanks a lot!非常感谢!
ct_paths
is iterated repeatedly in your inner loop, and you're only interested in a little bit of it for testing purposes; ct_paths
在您的内部循环中反复迭代,您只对它的一小部分感兴趣以用于测试目的; pull that out and use it to index the rest of your data, as a dictionary.把它拿出来,用它来索引你的其余数据,作为一个字典。
What does make your problem complicated is that you're wanting to end up with the original list of filenames, so you need to construct a two-level dictionary where the values are lists of all originals grouped under those two keys.使您的问题变得复杂的是您希望以原始文件名列表结束,因此您需要构建一个两级字典,其中值是分组在这两个键下的所有原始文件的列表。
ct_path_index = {}
for f in ct_paths:
ct_path_index.setdefault(f[16:20], {}).setdefault(f[21:26], []).append(f)
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for key, val in elem.items():
d2 = ct_path_index.get(key)
if d2:
for v in val:
v2 = d2.get(v)
if v2:
temp1 += v2
grpd_cts.append(temp1)
ct_path_index
looks like this, using your data: ct_path_index
看起来像这样,使用您的数据:
{'1102': {'00575': ['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv'],
'00578': ['/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv']}}
The use of setdefault
(which can be a little hard to understand the first time you see it) is important when building up collections of collections, and is very common in these kinds of cases: it makes sure that the sub-collections are created on demand and then re-used for a given key. setdefault
的使用(第一次看到它可能有点难以理解)在构建集合的集合时很重要,并且在这些情况下非常常见:它确保子集合是在需求,然后重新用于给定的密钥。
Now, you've only got two nested loops;现在,您只有两个嵌套循环; the inner checks are done using dictionary lookups, which are close to O(1).内部检查是使用接近 O(1) 的字典查找来完成的。
Other optimizations would include turning the lists in dict_list
into sets, which would be worthwhile if you made more than one pass through dict_list
.其他优化包括将dict_list
中的列表转换为集合,如果您多次通过dict_list
,这将是值得的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.