如何有效地检查元素是否在 python 的列表列表中

Question

I have a list of lists as follws.我有一个列表列表如下。

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

I also have a list of concepts as follows.我还有一个概念列表，如下所示。

myconcepts = ["method", "standing"]

I want to see how many times each concept in myconcepts is in mylist records.我想看看myconcepts中的每个概念在mylist记录中出现了多少次。 ie; IE;

"method" = 2 times in records (i.e. in `52749` and `5274923`)
"standing" = 2 times in records

My current code is as follows.我当前的代码如下。

mycounting = 0
for concept in myconcepts:
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
print(mycounting)

However, my current mylist is very very large and have about 5 million records.但是，我当前的mylist非常大，大约有 500 万条记录。 myconcepts list have about 10000 concepts. myconcepts列表有大约 10000 个概念。

In my current code it takes nearly 1 minute for a concept to get the count , which is very slow.在我当前的代码中，一个概念需要将近 1 分钟才能获得count ，这非常慢。

I would like to know the most efficient way of doing this in python?我想知道在 python 中执行此操作的最有效方法？

For testing purposes I have attached a small portion of my dataset in: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing出于测试目的，我将一小部分数据集附加到： https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing

I am happy to provide more details if needed.如果需要，我很乐意提供更多详细信息。

Answer 1

You can flatten the input and then use collections.Counter :您可以展平输入，然后使用collections.Counter ：

import collections
myconcepts = ["method", "standing"]
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]]
def flatten(d):
  for i in d:
    yield from [i] if not isinstance(i, list) else flatten(i)

r = collections.Counter(flatten(mylist))
result = {i:r.get(i, 0) for i in myconcepts}

Output: Output：

{'method': 2, 'standing': 2}

Edit: record lookup:编辑：记录查找：

result = {i:sum(i in b for _, b in mylist) for i in myconcepts}

Output: Output：

{'method': 2, 'standing': 2}

Answer 2

Adapting approach 3 from https://www.geeksforgeeks.org/python-count-the-sublists-containing-given-element-in-a-list/从https://www.geeksforgeeks.org/python-count-the-sublists- contains-given-element-in-a-list/调整方法 3

from itertools import chain 
from collections import Counter 

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

myconcepts = ["method", "standing"]

def countList(lst, x):
" Counts number of times item x appears in sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))[x] 

# Use dictionary comprehension to apply countList to concept list
result = {x:countList(mylist, x) for x in myconcepts}
print(result) # {'method':2, 'standing':2}

*Revised current method (compute counts only once) * *修改当前方法（只计算一次）*

def count_occurences(lst):
    " Number of counts of each item in all sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))

cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
print(result) # {'method':2, 'standing':2}

Performance (comparing posted methods using Jupyter Notebook)性能（比较使用 Jupyter Notebook 发布的方法）

Results show this method and Barmar posted method are close (ie 36 vs 42 us)结果显示此方法与 Barmar 发布的方法接近（即 36 与 42 us）

The improvement to the current method reduced to time approximately in half (ie from 36 us to 19 us).对当前方法的改进将时间缩短了大约一半（即从 36 微秒减少到 19 微秒）。 This improvement should be even more substantial for a larger number of concepts (ie problem has > 1000 concepts).对于更多的概念（即问题有 > 1000 个概念），这种改进应该更加显着。

However, the original method is faster at 2.55 us/loop.但是，原始方法更快，为 2.55 us/loop。

Method current method方法当前方法

%timeit { x:countList(mylist, x) for x in myconcepts}
#10000 loops, best of 3: 36.6 µs per loop

Revised current method:

%%timeit
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
10000 loops, best of 3: 19.4 µs per loop

Method 2 (from Barmar post)方法2（来自Barmar post）

%%timeit
r = collections.Counter(flatten(mylist))
{i:r.get(i, 0) for i in myconcepts}
# 10000 loops, best of 3: 42.7 µs per loop

Method 3 (Original Method)方法3（原始方法）

%%timeit

result = {}
for concept in myconcepts:
  mycounting = 0
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
  result[concept] = mycounting
  # 100000 loops, best of 3: 2.55 µs per loop

Answer 3

Change the concept lists to sets, so that searching will be O(1).将概念列表更改为集合，以便搜索将是 O(1)。 You can then use intersection to count the number of matches in each set.然后，您可以使用交集来计算每组中的匹配数。

import set
mylist = [
    [5274919, {"report", "porcelain", "firing", "technic"}], 
    [5274920, {"implantology", "dentistry"}], 
    [52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}], 
    [5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}]
]
myconcepts = {"method", "standing"}
mycounting = 0
for item in mylist:
    mycounting += len(set.intersection(myconcepts, item[1]))
print(mycounting)

If you want to get the counts for each concept separately, you'll need to loop over myconcept , then use the in operator.如果要分别获取每个概念的计数，则需要遍历myconcept ，然后使用in运算符。 You can put the results in a dictionary.您可以将结果放入字典中。

mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts}
print(mycounting) // {'standing': 2, 'method': 2}

This will still be more efficient than using a list, because concept in l[1] is O(1).这仍然比使用列表更有效，因为concept in l[1]是 O(1)。

Answer 4

As I know, programing in python is very slow, because the compiler is extremely lazy.据我所知，在 python 中编程非常慢，因为编译器非常懒惰。 So, I think that do not have a simple way to fix that, else by changing of computer language.所以，我认为没有简单的方法来解决这个问题，或者通过改变计算机语言。

如何有效地检查元素是否在 python 的列表列表中

问题描述

3 个解决方案

解决方案1
2 2019-11-12 22:13:27

解决方案2
2 已采纳 2019-11-12 22:35:49

解决方案3
1 2019-11-12 22:23:48

解决方案4
-3 2019-11-12 22:14:36

如何有效地检查元素是否在 python 的列表列表中

问题描述

3 个解决方案

解决方案1 2 2019-11-12 22:13:27

解决方案2 2 已采纳 2019-11-12 22:35:49

解决方案3 1 2019-11-12 22:23:48

解决方案4 -3 2019-11-12 22:14:36

解决方案1
2 2019-11-12 22:13:27

解决方案2
2 已采纳 2019-11-12 22:35:49

解决方案3
1 2019-11-12 22:23:48

解决方案4
-3 2019-11-12 22:14:36