简体   繁体   English

如何有效地检查元素是否在 python 的列表列表中

[英]How to efficiently check if an element is in a list of lists in python

I have a list of lists as follws.我有一个列表列表如下。

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

I also have a list of concepts as follows.我还有一个概念列表,如下所示。

myconcepts = ["method", "standing"]

I want to see how many times each concept in myconcepts is in mylist records.我想看看myconcepts中的每个概念在mylist记录中出现了多少次。 ie; IE;

"method" = 2 times in records (i.e. in `52749` and `5274923`)
"standing" = 2 times in records

My current code is as follows.我当前的代码如下。

mycounting = 0
for concept in myconcepts:
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
print(mycounting)

However, my current mylist is very very large and have about 5 million records.但是,我当前的mylist非常大,大约有 500 万条记录。 myconcepts list have about 10000 concepts. myconcepts列表有大约 10000 个概念。

In my current code it takes nearly 1 minute for a concept to get the count , which is very slow.在我当前的代码中,一个概念需要将近 1 分钟才能获得count ,这非常慢。

I would like to know the most efficient way of doing this in python?我想知道在 python 中执行此操作的最有效方法?

For testing purposes I have attached a small portion of my dataset in: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing出于测试目的,我将一小部分数据集附加到: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing

I am happy to provide more details if needed.如果需要,我很乐意提供更多详细信息。

You can flatten the input and then use collections.Counter :您可以展平输入,然后使用collections.Counter

import collections
myconcepts = ["method", "standing"]
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]]
def flatten(d):
  for i in d:
    yield from [i] if not isinstance(i, list) else flatten(i)

r = collections.Counter(flatten(mylist))
result = {i:r.get(i, 0) for i in myconcepts}

Output: Output:

{'method': 2, 'standing': 2}

Edit: record lookup:编辑:记录查找:

result = {i:sum(i in b for _, b in mylist) for i in myconcepts}

Output: Output:

{'method': 2, 'standing': 2}

Adapting approach 3 from https://www.geeksforgeeks.org/python-count-the-sublists-containing-given-element-in-a-list/https://www.geeksforgeeks.org/python-count-the-sublists- contains-given-element-in-a-list/调整方法 3

from itertools import chain 
from collections import Counter 

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

myconcepts = ["method", "standing"]

def countList(lst, x):
" Counts number of times item x appears in sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))[x] 

# Use dictionary comprehension to apply countList to concept list
result = {x:countList(mylist, x) for x in myconcepts}
print(result) # {'method':2, 'standing':2}

*Revised current method (compute counts only once) * *修改当前方法(只计算一次)*

def count_occurences(lst):
    " Number of counts of each item in all sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))

cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
print(result) # {'method':2, 'standing':2}

Performance (comparing posted methods using Jupyter Notebook)性能(比较使用 Jupyter Notebook 发布的方法)

Results show this method and Barmar posted method are close (ie 36 vs 42 us)结果显示此方法与 Barmar 发布的方法接近(即 36 与 42 us)

The improvement to the current method reduced to time approximately in half (ie from 36 us to 19 us).对当前方法的改进将时间缩短了大约一半(即从 36 微秒减少到 19 微秒)。 This improvement should be even more substantial for a larger number of concepts (ie problem has > 1000 concepts).对于更多的概念(即问题有 > 1000 个概念),这种改进应该更加显着。

However, the original method is faster at 2.55 us/loop.但是,原始方法更快,为 2.55 us/loop。

Method current method方法 当前方法

%timeit { x:countList(mylist, x) for x in myconcepts}
#10000 loops, best of 3: 36.6 µs per loop

Revised current method:

%%timeit
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
10000 loops, best of 3: 19.4 µs per loop

Method 2 (from Barmar post)方法2(来自Barmar post)

%%timeit
r = collections.Counter(flatten(mylist))
{i:r.get(i, 0) for i in myconcepts}
# 10000 loops, best of 3: 42.7 µs per loop

Method 3 (Original Method)方法3(原始方法)

%%timeit

result = {}
for concept in myconcepts:
  mycounting = 0
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
  result[concept] = mycounting
  # 100000 loops, best of 3: 2.55 µs per loop

Change the concept lists to sets, so that searching will be O(1).将概念列表更改为集合,以便搜索将是 O(1)。 You can then use intersection to count the number of matches in each set.然后,您可以使用交集来计算每组中的匹配数。

import set
mylist = [
    [5274919, {"report", "porcelain", "firing", "technic"}], 
    [5274920, {"implantology", "dentistry"}], 
    [52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}], 
    [5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}]
]
myconcepts = {"method", "standing"}
mycounting = 0
for item in mylist:
    mycounting += len(set.intersection(myconcepts, item[1]))
print(mycounting)

If you want to get the counts for each concept separately, you'll need to loop over myconcept , then use the in operator.如果要分别获取每个概念的计数,则需要遍历myconcept ,然后使用in运算符。 You can put the results in a dictionary.您可以将结果放入字典中。

mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts}
print(mycounting) // {'standing': 2, 'method': 2}

This will still be more efficient than using a list, because concept in l[1] is O(1).这仍然比使用列表更有效,因为concept in l[1]是 O(1)。

As I know, programing in python is very slow, because the compiler is extremely lazy.据我所知,在 python 中编程非常慢,因为编译器非常懒惰。 So, I think that do not have a simple way to fix that, else by changing of computer language.所以,我认为没有简单的方法来解决这个问题,或者通过改变计算机语言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何有效检查列表是否在列表的另一个列表中 - How to efficiently check if a list is in another list of lists python 如果 python 中的每个列表中都存在一个元素,如何有效地连接列表 - How to efficiently concatenate lists if an element is present in every list in python 如何检查列表列表中是否存在元素 - how to check if an element exists in a list of lists python 检查列表中的元素是否存在于python的多个列表中 - Check if an element in a list is present in multiple lists in python Python - 有效地检查列表是否存在且元素是否存在于列表中 - Python - efficiently check a list exists AND element exists in list python有效地比较列表列表 - python compare a list of lists efficiently Python:如何有效地检查项目是否在列表中? - Python: how to check that if an item is in a list efficiently? 如何使用 python 中的列表有效地对列表列表进行排序 - How to efficiently order a list of lists using a list in python 如何使用python比较列表中的元素并检查第一个列表元素是否包含在另一个列表的元素中 - How to compare elements in lists and check if first list element contains in another list's element using python 如何检查元素是否是列表和整数列表的一部分? - How to check if an element is part of a list of lists and integers?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM