简体   繁体   中英

Fastest way to check whether a value exists more often than X in a list

I have a long list (300 000 elements) and I want to check that each element in that list exists more than 5 times. So the simplest code is

[x for x in x_list if x_list.count(x) > 5]

However, I do not need to count how often x appears in the list, I can stop the counting after reaching at least 5 elements? I also do not need to go through all elements in x_list, since there is a chance that I checked value x already earlier when going through the list. Any idea how to get an optimal version for this code? My output should be a list, with the same order if possible...

Here is the Counter -based solution:

from collections import Counter

items = [2,3,4,1,2,3,4,1,2,1,3,4,4,1,2,4,3,1,4,3,4,1,2,1]
counts = Counter(items)
print(all(c >= 5 for c in counts.values())) #prints True

If I use

items = [random.randint(1,1000) for i in range(300000)]

The counter-based solution is still a fraction of a second.

Believe it or not, just doing a regular loop is much more efficient:

Data is generated via:

import random
N = 300000
arr = [random.random() for i in range(N)]
#and random ints are generated: arr = [random.randint(1,1000) for i in range(N)]

A regular loop computes in 0.22 seconds and if I use ints then it is .12 (very comparable to that of collections) (on a 2.4 Ghz processor).

di = {}
for item in arr:
    if item in di:
        di[item] += 1
    else:
        di[item] = 1
print (min(di.values()) > 5)

Your version greater than 30 seconds with or without integers.

[x for x in arr if arr.count(x) > 5]

And using collections takes about .33 seconds and .11 if I use integers.

from collections import Counter

counts = Counter(arr)
print(all(c >= 5 for c in counts.values()))

Finally, this takes greater than 30 seconds with or without integers:

count = [0]*(max(x_list)+1)
for x in x_list:
    count[x]+=1;
return [index for index, value in enumerate(count) if value >= 5]

If you are looking for a more optimized way, you can use numpy.unique() method which is by far faster than python methods for large arrays like the one that you're dealing with:

import numpy as np
(np.unique(arr, return_counts=True)[1] > 5).all()

Also as a pythonic way you can use collections.defaultdict() like following:

In [56]: from collections import defaultdict

In [57]: def check_defaultdict(arr):                                   
             di = defaultdict(int)
             for item in arr:
                 di[item] += 1
             return (min(di.values()) > 5)
   ....: 

Here is a benchmark with other methods:

In [39]: %timeit (np.unique(arr, return_counts=True)[1] > 5).all()
100 loops, best of 3: 18.8 ms per loop

In [58]: %timeit check_defaultdict(arr)
10 loops, best of 3: 46.1 ms per loop
"""
In [42]: def check(arr):
             di = {}
             for item in arr:
                 if item in di:
                    di[item] += 1
                 else:
                    di[item] = 1
             return (min(di.values()) > 5)
   ....:          
"""
In [43]: %timeit check(arr)
10 loops, best of 3: 56.6 ms per loop

In [38]: %timeit all(c >= 5 for c in Counter(arr).values())
10 loops, best of 3: 89.5 ms per loop

To count all elements you could do something like this:

def atLeastFiveOfEach(x_list):
    count = [0]*(max(x_list)+1)
    for x in x_list:
        count[x]+=1;
    if min(count)<5:
        return False
    return True

Then you have list, count where count[i] is the number of occurrences of i in x_list.

If you want a list of all those elements, you can do like this:

def atLeastFiveOfEach(x_list):
    count = [0]*(max(x_list)+1)
    for x in x_list:
        count[x]+=1;
    return [index for index, value in enumerate(count) if value >= 5]

To explain a little bit why this is so much faster:

In your method, you pick the first element and goes through the whole list to see how many elements that equals that element it exists. Then you take the second element and traverse the whole list again. You're going through the whole list once FOR EACH element.

This method, on the other hand only goes through the list once. That's why it is much faster.

Use itertools.islice . It returns only selected items from an iterable.

from itertools import islice

def has_at_least_n(iterable, item, n=5):
    filter = (i for i in iterable if i == item)
    return next(islice(filter, n-1, None), False)

From Python documentation, here is what it has to say on itertools.islice

Make an iterator that returns selected elements from the iterable. If start is non-zero, then elements from the iterable are skipped until start is reached. Afterward, elements are returned consecutively unless step is set higher than one which results in items being skipped. If stop is None, then iteration continues until the iterator is exhausted, if at all; otherwise, it stops at the specified position. Unlike regular slicing, islice() does not support negative values for start, stop, or step. Can be used to extract related fields from data where the internal structure has been flattened (for example, a multi-line report may list a name field on every third line)

From Moses Koledoye's answer here:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM