[英]How to efficiently check if an element is in a list of lists in python
我有一个列表列表如下。
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]
我还有一个概念列表,如下所示。
myconcepts = ["method", "standing"]
我想看看myconcepts
中的每个概念在mylist
记录中出现了多少次。 IE;
"method" = 2 times in records (i.e. in `52749` and `5274923`)
"standing" = 2 times in records
我当前的代码如下。
mycounting = 0
for concept in myconcepts:
for item in mylist:
if concept in item[1]:
mycounting = mycounting + 1
print(mycounting)
但是,我当前的mylist
非常大,大约有 500 万条记录。 myconcepts
列表有大约 10000 个概念。
在我当前的代码中,一个概念需要将近 1 分钟才能获得count
,这非常慢。
我想知道在 python 中执行此操作的最有效方法?
出于测试目的,我将一小部分数据集附加到: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing
如果需要,我很乐意提供更多详细信息。
您可以展平输入,然后使用collections.Counter
:
import collections
myconcepts = ["method", "standing"]
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]]
def flatten(d):
for i in d:
yield from [i] if not isinstance(i, list) else flatten(i)
r = collections.Counter(flatten(mylist))
result = {i:r.get(i, 0) for i in myconcepts}
Output:
{'method': 2, 'standing': 2}
编辑:记录查找:
result = {i:sum(i in b for _, b in mylist) for i in myconcepts}
Output:
{'method': 2, 'standing': 2}
从https://www.geeksforgeeks.org/python-count-the-sublists- contains-given-element-in-a-list/调整方法 3
from itertools import chain
from collections import Counter
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]
myconcepts = ["method", "standing"]
def countList(lst, x):
" Counts number of times item x appears in sublists "
return Counter(chain.from_iterable(set(i[1]) for i in lst))[x]
# Use dictionary comprehension to apply countList to concept list
result = {x:countList(mylist, x) for x in myconcepts}
print(result) # {'method':2, 'standing':2}
*修改当前方法(只计算一次)*
def count_occurences(lst):
" Number of counts of each item in all sublists "
return Counter(chain.from_iterable(set(i[1]) for i in lst))
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
print(result) # {'method':2, 'standing':2}
性能(比较使用 Jupyter Notebook 发布的方法)
结果显示此方法与 Barmar 发布的方法接近(即 36 与 42 us)
对当前方法的改进将时间缩短了大约一半(即从 36 微秒减少到 19 微秒)。 对于更多的概念(即问题有 > 1000 个概念),这种改进应该更加显着。
但是,原始方法更快,为 2.55 us/loop。
方法 当前方法
%timeit { x:countList(mylist, x) for x in myconcepts}
#10000 loops, best of 3: 36.6 µs per loop
Revised current method:
%%timeit
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
10000 loops, best of 3: 19.4 µs per loop
方法2(来自Barmar post)
%%timeit
r = collections.Counter(flatten(mylist))
{i:r.get(i, 0) for i in myconcepts}
# 10000 loops, best of 3: 42.7 µs per loop
方法3(原始方法)
%%timeit
result = {}
for concept in myconcepts:
mycounting = 0
for item in mylist:
if concept in item[1]:
mycounting = mycounting + 1
result[concept] = mycounting
# 100000 loops, best of 3: 2.55 µs per loop
将概念列表更改为集合,以便搜索将是 O(1)。 然后,您可以使用交集来计算每组中的匹配数。
import set
mylist = [
[5274919, {"report", "porcelain", "firing", "technic"}],
[5274920, {"implantology", "dentistry"}],
[52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}],
[5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}]
]
myconcepts = {"method", "standing"}
mycounting = 0
for item in mylist:
mycounting += len(set.intersection(myconcepts, item[1]))
print(mycounting)
如果要分别获取每个概念的计数,则需要遍历myconcept
,然后使用in
运算符。 您可以将结果放入字典中。
mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts}
print(mycounting) // {'standing': 2, 'method': 2}
这仍然比使用列表更有效,因为concept in l[1]
是 O(1)。
据我所知,在 python 中编程非常慢,因为编译器非常懒惰。 所以,我认为没有简单的方法来解决这个问题,或者通过改变计算机语言。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.