简体   繁体   中英

Fastest way to check if exactly n items in a list match a condition in python

If I have m items in a list, what is the fastest way to check if exactly n of those items in the list meet a certain condition? For example:

l = [1,2,3,4,5]

How would I check if any two items in the list match the condition x%2 == 0 ?

The naive approach would be to use nested for loops:

for i in l:
    for j in l:
        if not i%2 and not j%2:
            return True

But that is an incredibly inefficient way of checking, and would become especially ugly if I wanted to check for any 50,000 items in a list of 2-10 million items.

[Edited to reflect exact matching, which we can still accomplish with short-circuiting!]

I think you'd want this to short-circuit (stop when determined, not only at the end):

matched = 0
for i in l:
    if i%2 == 0:
        matched += 1
        if matched > 2: # we now have too many matches, stop checking
            break
if matched == 2:
    print("congratulations")

If you wanted to do the query much faster on the same input data several times, you should use NumPy instead (with no short-circuiting):

l = np.array([1,2,3,4,5])

if np.count_nonzero(l%2 == 0) == 2:
    print "congratulations"

This doesn't short-circuit, but it will be super-fast once the input array is constructed, so if you have a large input list and lots of queries to do on it, and the queries can't short-circuit very early, this will likely be faster. Potentially by an order of magnitude.

A sum solution adding up True valuesis correct, probably more efficient than an explicit loop, and definitely the most concise:

if sum(i % 2 == 0 for i in lst) == n:

However, it relies on understanding that in an integer context like addition, True counts as 1 and False as 0 . You may not want to count on that. In which case you can rewrite it (squiguy's answer):

if sum(1 for i in lst if i % 2 == 0) == n:

But you might want to factor this out into a function:

def count_matches(predicate, iterable):
    return sum(predicate(i) for i in iterable)

And at that point, it might arguably be more readable to filter the list and count the length of the resulting filtered iterable instead:

def ilen(iterable):
    return sum(1 for _ in iterable)

def count_matches(predicate, iterable):
    return ilen(filter(predicate, iterable))

However, the down side of all of these variations—as with any use of map or filter is that your predicate has to be a function , not just an expression. That's fine when you just wanted to check that some_function(x) returns True, but when you want to check x % 2 == 0 , you have to go to the extra step of wrapping it in a function, like this:

if count_matches(lambda x: x %2 == 0, lst) == n

… at which point I think you lose more readability than you gain.


Since you asked for the fastest—even though that's probably misguided, since I'm sure any of these solutions are more than fast enough for almost any app, and this is unlikely to be a hotspot anyway—here are some tests with 64-bit CPython 3.3.2 on my computer with a length of 250:

32.9 µs: sum(not x % 2 for x in lst)
33.1 µs: i=0\nfor x in lst: if not x % 2: i += 1\n
34.1 µs: sum(1 for x in lst if not x % 2)
34.7 µs: i=0\nfor x in lst: if x % 2 == 0: i += 1\n
35.3 µs: sum(x % 2 == 0 for x in lst)
37.3 µs: sum(1 for x in lst if x % 2 == 0)
52.5 µs: ilen(filter(lambda x: not x % 2, lst))
56.7 µs: ilen(filter(lambda x: x % 2 == 0, lst))

So, as it turns out, at least in 64-bit CPython 3.3.2 whether you use an explicit loop, sum up False and True, or sum up 1s if True makes very little difference; using not instead of == 0 makes a bigger difference in some cases than the others; but even the worst of these is only 12% worse than the best.

So I would use whichever one you find most readable. And, if the slowest one isn't fast enough, the fastest one probably isn't either, which means you will probably need to rearrange your app to use NumPy, run your app in PyPy instead of CPython, write custom Cython or C code, or do something else a lot more drastic than just reorganizing this trivial algorithm.

For comparison, here's some NumPy implementations (assuming lst is a np.ndarray rather than a list ):

 6.4 µs: len(lst) - np.count_nonzero(lst % 2)
 8.5 µs: np.count_nonzero(lst % 2 == 0)
17.5 µs: np.sum(lst % 2 == 0)

Even the most obvious translation to NumPy is almost twice as fast; with a bit of work you can get it 3x faster still.

And here's the result of running the exact same code in PyPy (3.2.3/2.1b1) instead of CPython:

14.6 µs: sum(not x % 2 for x in lst)

More than twice as fast with no change in the code at all.

You might want to look into numpy

For example:

In [16]: import numpy as np 
In [17]: a = np.arange(5)

In [18]: a
Out[18]: array([0, 1, 2, 3, 4])

In [19]: np.sum(a % 2 == 0)
Out[19]: 3

Timings:

In [14]: %timeit np.sum(np.arange(100000) % 2 == 0)
100 loops, best of 3: 3.03 ms per loop

In [15]: %timeit sum(ele % 2 == 0 for ele in range(100000))
10 loops, best of 3: 17.8 ms per loop

However, if you account for conversion from list to numpy.array , numpy is not faster:

In [20]: %timeit np.sum(np.array(range(100000)) % 2 == 0)
10 loops, best of 3: 23.5 ms per loop

Edit:

@abarnert's solution is the fastest:

In [36]: %timeit(len(np.arange(100000)) - np.count_nonzero(a % 2))
10000 loops, best of 3: 80.4 us per loop

I would use a while loop:

l=[1,2,3,4,5]

mods, tgt=0,2
while mods<tgt and l:
    if l.pop(0)%2==0:
        mods+=1

print(l,mods)  

If you are concerned about 'fastest' replace the list with a deque :

from collections import deque

l=[1,2,3,4,5]
d=deque(l)
mods, tgt=0,2
while mods<tgt and d:
    if d.popleft()%2==0: mods+=1

print(d,mods)     

In either case, it is easy to read and will short circuit when the condition is met.

This does do exact matching as written with short-circuiting:

from collections import deque

l=[1,2,3,4,5,6,7,8,9]
d=deque(l)
mods, tgt=0,2
while mods<tgt and d:
    if d.popleft()%2==0: mods+=1

print(d,mods,mods==tgt)
# deque([5, 6, 7, 8, 9]) 2 True
# answer found after 4 loops


from collections import deque

l=[1,2,3,4,5,6,7,8,9]
d=deque(l)
mods, tgt=0,2
while mods<tgt and d:
    if d.popleft()%9==0: mods+=1

print(d,mods,mods==tgt)
# deque([]) 1 False
# deque exhausted and less than 2 matches found...

You can also use an iterator over your list:

l=[1,2,3,4,5,6,7,8,9]
it=iter(l)
mods, tgt=0,2
while mods<tgt:
    try:
        if next(it)%2==0: mods+=1
    except StopIteration:
        break

print(mods==tgt)   
# True

You could use the sum built in with your condition and check that it equals your n value.

l = [1, 2, 3, 4, 5]
n = 2
if n == sum(1 for i in l if i % 2 == 0):
    print(True)

Why don't you just use filter() ?

Ex.: Checking number of even integers in a list:

>>> a_list = [1, 2, 3, 4, 5]
>>> matches = list(filter(lambda x: x%2 == 0, a_list))
>>> matches
[2, 4]

then if you want the number of matches:

>>> len(matches)
2

And finally your answer:

>>> if len(matches) == 2:
        do_something()

Build a generator that returns 1 for each item that matches the criteria and limit that generator to at most n + 1 items, and check that the sum of the ones is equal to the number you're after, eg:

from itertools import islice

data = [1,2,3,4,5]
N = 2
items = islice((1 for el in data if el % 2 == 0), N + 1)
has_N = sum(items) == N

This works:

>>> l = [1,2,3,4,5]
>>> n = 2
>>> a = 0  # Number of items that meet the condition
>>> for x in l:
...     if x % 2 == 0:
...         a += 1
...         if a > n:
...             break
...
>>> a == n
True
>>>

It has the advantage of running trough the list only once.

Itertools is a useful shortcut for list trolling tasks

import itertools

#where expr is a lambda, such as 'lambda a: a % 2 ==0'
def exact_match_count ( expr, limit,  *values):
    passes = itertools.ifilter(expr, values)
    counter = 0
    while counter <= limit + 1:
        try:
            passes.next()
            counter +=1
        except:
            break
    return counter == limit

if you're concerned about memory limit, tweak the signature so that *values is a generator rather than a tuple

Any candidate for "the fastest solution" needs to have a single pass over the input and an early-out.

Here is a good base-line starting point for a solution:

>>> s = [1, 2, 3, 4, 5]
>>> matched = 0
>>> for x in s:
        if x % 2 == 0:
            matched += 1
            if matched > 2:
                print 'More than two matched'
else:
    if matched == 2:
        print 'Exactly two matched'
    else:
        print 'Fewer than two matched'


Exactly two matched

Here are some ideas for improving on the the algorithmicially correct baseline solution:

  1. Optimize the computation of the condition. For example, replace x % 2 == 0 with not x & 1 . This is called reduction in strength .

  2. Localize the variables. Since global lookups and assignments are more expensive than local variable assignments, the exact match test will run faster if it is inside a function.

    For example:

     def two_evens(iterable): 'Return true if exactly two values are even' matched = 0 for x in s: if x % 2 == 0: matched += 1 if matched > 2: return False return matched == 2 
  3. Remove the interpreter overhead by using itertools to drive the looping logic.

    For example, itertools.ifilter() can isolate the matches at C-speed:

     >>> list(ifilter(None, [False, True, True, False, True])) [True, True, True] 

    Likewise, itertools.islice() can implement the early-out logic at C speed:

     >>> list(islice(range(10), 0, 3)) [0, 1, 2] 

    The built-in sum() function can tally the matches at C speed.

     >>> sum([True, True, True]) 3 

    Put these together to check for an exact number of matches:

     >>> s = [False, True, False, True, False, False, False] >>> sum(islice(ifilter(None, s), 0, 3)) == 2 True 
  4. These optimizations are only worth doing if it is an actual bottleneck in a real program. That would typically only occur if you're going to make many such exact-match-count tests. If so, then there may be additional savings by caching some of the intermediate results on the first pass and then reusing them on subsequent tests.

    For example, if there is a complex condition, the sub-condition results can potentially be cached and reused.

    Instead of:

     check_exact(lambda x: x%2==0 and x<10 and f(x)==3, dataset, matches=2) check_exact(lambda x: x<10 and f(x)==3, dataset, matches=4) check_exact(lambda x: x%2==0 and f(x)==3, dataset, matches=6) 

    Pre-compute all the conditions (only once per data value):

     evens = map(lambda x: x%2==0, dataset) under_tens = map(lambda x: x<10, dataset) f_threes = map(lambda x: x%2==0 and f(x)==3, dataset) 

A simple way to do it:

def length_is(iter, size):
    for _ in xrange(size - 1):
        next(iter, None)

    try:
        next(iter)
    except StopIteration:
        return False  # too few

    try:
        next(iter)
        return False  # too many
    except StopIteration:
        return True
length_is((i for i in data if x % 2 == 0), 2)

Here's a slightly sillier way to write it:

class count(object):
    def __init__(self, iter):
        self.iter = iter

    __eq__ = lambda self, n: length_is(self.iter, n)

Giving:

count(i for i in data if x % 2 == 0) == 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM