简体   繁体   English

检查值是否比列表中的X更常存在的最快方法

[英]Fastest way to check whether a value exists more often than X in a list

I have a long list (300 000 elements) and I want to check that each element in that list exists more than 5 times. 我有一个很长的列表(300 000个元素),我想检查该列表中的每个元素是否存在超过5次。 So the simplest code is 所以最简单的代码是

[x for x in x_list if x_list.count(x) > 5]

However, I do not need to count how often x appears in the list, I can stop the counting after reaching at least 5 elements? 但是,我不需要计算x在列表中出现的频率,我可以在达到至少5个元素后停止计数? I also do not need to go through all elements in x_list, since there is a chance that I checked value x already earlier when going through the list. 我也不需要遍历x_list中的所有元素,因为在查看列表时我有可能已经检查了值x。 Any idea how to get an optimal version for this code? 知道如何获得此代码的最佳版本吗? My output should be a list, with the same order if possible... 我的输出应该是一个列表,尽可能使用相同的顺序...

Here is the Counter -based solution: 这是基于Counter的解决方案:

from collections import Counter

items = [2,3,4,1,2,3,4,1,2,1,3,4,4,1,2,4,3,1,4,3,4,1,2,1]
counts = Counter(items)
print(all(c >= 5 for c in counts.values())) #prints True

If I use 如果我使用

items = [random.randint(1,1000) for i in range(300000)]

The counter-based solution is still a fraction of a second. 基于反制的解决方案仍然是一小部分。

Believe it or not, just doing a regular loop is much more efficient: 信不信由你,只是做一个常规循环效率更高:

Data is generated via: 数据来自:

import random
N = 300000
arr = [random.random() for i in range(N)]
#and random ints are generated: arr = [random.randint(1,1000) for i in range(N)]

A regular loop computes in 0.22 seconds and if I use ints then it is .12 (very comparable to that of collections) (on a 2.4 Ghz processor). 常规循环计算在0.22秒内,如果我使用整数,那么它是.12(非常类似于集合)(在2.4 Ghz处理器上)。

di = {}
for item in arr:
    if item in di:
        di[item] += 1
    else:
        di[item] = 1
print (min(di.values()) > 5)

Your version greater than 30 seconds with or without integers. 您的版本大于30秒,有或没有整数。

[x for x in arr if arr.count(x) > 5]

And using collections takes about .33 seconds and .11 if I use integers. 如果我使用整数,使用集合大约需要0.33秒和.11。

from collections import Counter

counts = Counter(arr)
print(all(c >= 5 for c in counts.values()))

Finally, this takes greater than 30 seconds with or without integers: 最后,无论是否有整数,这都需要30秒以上:

count = [0]*(max(x_list)+1)
for x in x_list:
    count[x]+=1;
return [index for index, value in enumerate(count) if value >= 5]

If you are looking for a more optimized way, you can use numpy.unique() method which is by far faster than python methods for large arrays like the one that you're dealing with: 如果你正在寻找一种更优化的方法,你可以使用numpy.unique()方法,这比你正在处理的大型数组的python方法要快得多:

import numpy as np
(np.unique(arr, return_counts=True)[1] > 5).all()

Also as a pythonic way you can use collections.defaultdict() like following: 另外,作为pythonic方式,您可以使用collections.defaultdict() ,如下所示:

In [56]: from collections import defaultdict

In [57]: def check_defaultdict(arr):                                   
             di = defaultdict(int)
             for item in arr:
                 di[item] += 1
             return (min(di.values()) > 5)
   ....: 

Here is a benchmark with other methods: 以下是其他方法的基准:

In [39]: %timeit (np.unique(arr, return_counts=True)[1] > 5).all()
100 loops, best of 3: 18.8 ms per loop

In [58]: %timeit check_defaultdict(arr)
10 loops, best of 3: 46.1 ms per loop
"""
In [42]: def check(arr):
             di = {}
             for item in arr:
                 if item in di:
                    di[item] += 1
                 else:
                    di[item] = 1
             return (min(di.values()) > 5)
   ....:          
"""
In [43]: %timeit check(arr)
10 loops, best of 3: 56.6 ms per loop

In [38]: %timeit all(c >= 5 for c in Counter(arr).values())
10 loops, best of 3: 89.5 ms per loop

To count all elements you could do something like this: 要计算所有元素,您可以执行以下操作:

def atLeastFiveOfEach(x_list):
    count = [0]*(max(x_list)+1)
    for x in x_list:
        count[x]+=1;
    if min(count)<5:
        return False
    return True

Then you have list, count where count[i] is the number of occurrences of i in x_list. 然后你有list,count其中count [i]是x_list中i的出现次数。

If you want a list of all those elements, you can do like this: 如果你想要一个包含所有这些元素的列表,你可以这样做:

def atLeastFiveOfEach(x_list):
    count = [0]*(max(x_list)+1)
    for x in x_list:
        count[x]+=1;
    return [index for index, value in enumerate(count) if value >= 5]

To explain a little bit why this is so much faster: 为了解释一下为什么这么快:

In your method, you pick the first element and goes through the whole list to see how many elements that equals that element it exists. 在您的方法中,您选择第一个元素并浏览整个列表以查看与其存在的元素相等的元素数量。 Then you take the second element and traverse the whole list again. 然后你取第二个元素并再次遍历整个列表。 You're going through the whole list once FOR EACH element. 您将在FOR EACH元素中浏览整个列表。

This method, on the other hand only goes through the list once. 另一方面,此方法仅通过列表一次。 That's why it is much faster. 这就是为什么它要快得多。

Use itertools.islice . 使用itertools.islice It returns only selected items from an iterable. 它仅返回iterable中的选定项。

from itertools import islice

def has_at_least_n(iterable, item, n=5):
    filter = (i for i in iterable if i == item)
    return next(islice(filter, n-1, None), False)

From Python documentation, here is what it has to say on itertools.islice 从Python文档中,这里是它在itertools.islice上所说的内容

Make an iterator that returns selected elements from the iterable. 创建一个迭代器,从迭代中返回所选元素。 If start is non-zero, then elements from the iterable are skipped until start is reached. 如果start为非零,则跳过iterable中的元素,直到达到start。 Afterward, elements are returned consecutively unless step is set higher than one which results in items being skipped. 之后,连续返回元素,除非将step设置为高于导致跳过项目的步骤。 If stop is None, then iteration continues until the iterator is exhausted, if at all; 如果stop为None,则迭代继续,直到迭代器耗尽,如果有的话; otherwise, it stops at the specified position. 否则,它停在指定位置。 Unlike regular slicing, islice() does not support negative values for start, stop, or step. 与常规切片不同,islice()不支持start,stop或step的负值。 Can be used to extract related fields from data where the internal structure has been flattened (for example, a multi-line report may list a name field on every third line) 可用于从内部结构已展平的数据中提取相关字段(例如,多行报表可能会在每个第三行列出名称字段)

From Moses Koledoye's answer here: 来自摩西科莱约耶的回答

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM