[英]Quick way to compare two big lists of dictionaries?
我有两个相当大的字典列表,两者的长度都在 100 万左右。 我想要做的是比较两个列表并检测 list1 是否有 list2 中不存在的任何字典。 我正在使用以下代码来实现这一点:
def compare_lists(list1, list2):
new_items = [i for i in list1 if i not in list2]
return new_items
它按预期工作,但问题是,它非常慢 - 由于两个列表的长度,运行比较需要一个多小时。
有没有办法让它运行得更快? 我必须比较完整的字典,而不仅仅是某些项目,因为每个键:值对可能在两个列表中有所不同。
方法
使用this answer中的想法将字典列表转换为一组字典
代码
from json import dumps, loads
def find_difference(lst1, lst2):
# Returns elements in lst1 which are not in lst2
set1 = dics_to_set(lst1)
set2 = dics_to_set(lst2)
# Deserialize elements in set1 that are not in set2
return [loads(x) for x in set1.difference(set2)] # items in set1 that are not in set2
def dics_to_set(lst):
'''
Convert list of dicts to set
'''
return set(dumps(x, sort_keys=True) for x in lst) # sort_keys to control order of keys
表现
概括
测试设置:
测试代码
def rand_dicts(n):
'''
Create random dictionary of n elements
'''
mydict = {}
for i in range(n):
mydict[f'key{i}'] = randrange(100)
return mydict
# List of random dictionaries with 5 elements each
lst2 = [rand_dicts(5) for _ in range(100000)]
# Copy of list 2 with one more random dictionary added
lst1 = lst2 + [rand_dicts(1)]
使用 timeit 模块计时
# Test of Posted Code
%timeit [x for x in lst1 if x not in lst2]
# Output: 9min 3s ± 13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Test of proposed code
%timeit find_difference(lst1, lst2)
Output: 2.06 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.