[英]How to efficiently check if an element is in a list of lists in python
I have a list of lists as follws.我有一个列表列表如下。
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]
I also have a list of concepts as follows.我还有一个概念列表,如下所示。
myconcepts = ["method", "standing"]
I want to see how many times each concept in myconcepts
is in mylist
records.我想看看
myconcepts
中的每个概念在mylist
记录中出现了多少次。 ie; IE;
"method" = 2 times in records (i.e. in `52749` and `5274923`)
"standing" = 2 times in records
My current code is as follows.我当前的代码如下。
mycounting = 0
for concept in myconcepts:
for item in mylist:
if concept in item[1]:
mycounting = mycounting + 1
print(mycounting)
However, my current mylist
is very very large and have about 5 million records.但是,我当前的
mylist
非常大,大约有 500 万条记录。 myconcepts
list have about 10000 concepts. myconcepts
列表有大约 10000 个概念。
In my current code it takes nearly 1 minute for a concept to get the count
, which is very slow.在我当前的代码中,一个概念需要将近 1 分钟才能获得
count
,这非常慢。
I would like to know the most efficient way of doing this in python?我想知道在 python 中执行此操作的最有效方法?
For testing purposes I have attached a small portion of my dataset in: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing出于测试目的,我将一小部分数据集附加到: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing
I am happy to provide more details if needed.如果需要,我很乐意提供更多详细信息。
You can flatten the input and then use collections.Counter
:您可以展平输入,然后使用
collections.Counter
:
import collections
myconcepts = ["method", "standing"]
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]]
def flatten(d):
for i in d:
yield from [i] if not isinstance(i, list) else flatten(i)
r = collections.Counter(flatten(mylist))
result = {i:r.get(i, 0) for i in myconcepts}
Output: Output:
{'method': 2, 'standing': 2}
Edit: record lookup:编辑:记录查找:
result = {i:sum(i in b for _, b in mylist) for i in myconcepts}
Output: Output:
{'method': 2, 'standing': 2}
Adapting approach 3 from https://www.geeksforgeeks.org/python-count-the-sublists-containing-given-element-in-a-list/从https://www.geeksforgeeks.org/python-count-the-sublists- contains-given-element-in-a-list/调整方法 3
from itertools import chain
from collections import Counter
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]
myconcepts = ["method", "standing"]
def countList(lst, x):
" Counts number of times item x appears in sublists "
return Counter(chain.from_iterable(set(i[1]) for i in lst))[x]
# Use dictionary comprehension to apply countList to concept list
result = {x:countList(mylist, x) for x in myconcepts}
print(result) # {'method':2, 'standing':2}
*Revised current method (compute counts only once) * *修改当前方法(只计算一次)*
def count_occurences(lst):
" Number of counts of each item in all sublists "
return Counter(chain.from_iterable(set(i[1]) for i in lst))
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
print(result) # {'method':2, 'standing':2}
Performance (comparing posted methods using Jupyter Notebook)性能(比较使用 Jupyter Notebook 发布的方法)
Results show this method and Barmar posted method are close (ie 36 vs 42 us)结果显示此方法与 Barmar 发布的方法接近(即 36 与 42 us)
The improvement to the current method reduced to time approximately in half (ie from 36 us to 19 us).对当前方法的改进将时间缩短了大约一半(即从 36 微秒减少到 19 微秒)。 This improvement should be even more substantial for a larger number of concepts (ie problem has > 1000 concepts).
对于更多的概念(即问题有 > 1000 个概念),这种改进应该更加显着。
However, the original method is faster at 2.55 us/loop.但是,原始方法更快,为 2.55 us/loop。
Method current method方法 当前方法
%timeit { x:countList(mylist, x) for x in myconcepts}
#10000 loops, best of 3: 36.6 µs per loop
Revised current method:
%%timeit
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
10000 loops, best of 3: 19.4 µs per loop
Method 2 (from Barmar post)方法2(来自Barmar post)
%%timeit
r = collections.Counter(flatten(mylist))
{i:r.get(i, 0) for i in myconcepts}
# 10000 loops, best of 3: 42.7 µs per loop
Method 3 (Original Method)方法3(原始方法)
%%timeit
result = {}
for concept in myconcepts:
mycounting = 0
for item in mylist:
if concept in item[1]:
mycounting = mycounting + 1
result[concept] = mycounting
# 100000 loops, best of 3: 2.55 µs per loop
Change the concept lists to sets, so that searching will be O(1).将概念列表更改为集合,以便搜索将是 O(1)。 You can then use intersection to count the number of matches in each set.
然后,您可以使用交集来计算每组中的匹配数。
import set
mylist = [
[5274919, {"report", "porcelain", "firing", "technic"}],
[5274920, {"implantology", "dentistry"}],
[52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}],
[5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}]
]
myconcepts = {"method", "standing"}
mycounting = 0
for item in mylist:
mycounting += len(set.intersection(myconcepts, item[1]))
print(mycounting)
If you want to get the counts for each concept separately, you'll need to loop over myconcept
, then use the in
operator.如果要分别获取每个概念的计数,则需要遍历
myconcept
,然后使用in
运算符。 You can put the results in a dictionary.您可以将结果放入字典中。
mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts}
print(mycounting) // {'standing': 2, 'method': 2}
This will still be more efficient than using a list, because concept in l[1]
is O(1).这仍然比使用列表更有效,因为
concept in l[1]
是 O(1)。
As I know, programing in python is very slow, because the compiler is extremely lazy.据我所知,在 python 中编程非常慢,因为编译器非常懒惰。 So, I think that do not have a simple way to fix that, else by changing of computer language.
所以,我认为没有简单的方法来解决这个问题,或者通过改变计算机语言。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.