简体   繁体   中英

Python: fastest way to extract sublist from a list of objects given an attribute

Say I have this simple class:

class Foo(object):
    def __init__(self, number, name):
        self.number = number
        self.name = name

and a list of Foo instances:

l = [Foo(10, 'a'), Foo(9, 'a'), Foo(8, 'a'), Foo(7,'a'), Foo (5, 'b'), Foo (4, 'b') ,Foo (3, 'b')]

Say that the 'name' attribute can only be either 'a' or 'b'.

What is the fastest way to extract the sublist of all the objects whose 'name' is 'a' (or 'b')? Notice that this operation might be called several million times and this is why I want to optimize it as much as I can.

Note that the list is built in a way such that it will have all the elements 'grouped together' in the first or second half of the list. The list is symmetric and order by the decreasing attribute 'number'. EDIT : Not necessarily there is the same number of 'a' and 'b'.


How I do it:

In the beginning I was just doing a for loop:

sublist = []
for o in l:
  if o.name == 'a'
  sublist.append(o)

Then I tried with a list comprehension:

sublist = [o for o in l if o.name=='a']

But this seems to be approximately the same if not a bit slower.

Either way, neither of those exploits the assumption that all the attributes are already 'grouped together' in the original (sorted) list. It will keep looping even when it's no longer necessary. Speed is very important so I need it to be as performant as possible.

Just break out of the loop once you hit a non-match after matching

sublist = []
for o in l:
    if o.name == 'a'
        sublist.append(o)
    elif sublist:
        break

If you wanted to use generators, you could use the itertools functions

from itertools import takewhile, dropwhile

sublist = list(takewhile(lambda o: o.name == 'a', dropwhile(lambda o: o.name != 'a', l))

These both exploit the fact that the list is sorted and stop processing the list after the items stop matching.

Since the name attribute can only be 'a' or 'b' which are ordered and you have the same number of 'a' and 'b', the simplest way would be to find the middle point and slice the list:

mid = int(len(aList)/2)
sublist = l[:mid]

The above will give you all 'a' while l[mid:] gives all 'b'.


Edit: Since the question was changed and it's no longer true that the number of elements of 'a' and 'b' are the same the above answer does not work anymore.

Depending on the length of the list, my guess would be that either binary search (for longer lists) or breaking out of the loop as Brendan suggested (for shorter ones) would be the fastest approach.

Use binary search to find the middle point in O(logN):

In [19]: class Foo(object):
    ...:     def __init__(self, number, name):
    ...:         self.number = number
    ...:         self.name = name
    ...:         
    ...:     def __repr__(self):
    ...:         return 'Foo(number={self.number}, name={self.name})'.format(self=self)
    ...:     

In [20]: def binary_search(lst, predicate):
    ...:     """
    ...:     Finds the first element for which predicate(x) == True
    ...:     """
    ...:     lo, hi = 0, len(lst)
    ...:     while lo < hi:
    ...:         mid = (lo + hi) // 2
    ...:         if predicate(lst[mid]):
    ...:             hi = mid
    ...:         else:
    ...:             lo = mid + 1
    ...:     return lo
    ...: 

In [21]: l = [Foo(10, 'a'), Foo(9, 'a'), Foo(8, 'a'), Foo(7,'a'), Foo (5, 'b'), Foo (4, 'b'
    ...: ) ,Foo (3, 'b')]

In [22]: binary_search(l, lambda x: x.name == 'b')
Out[22]: 4

In [23]: l[:binary_search(l, lambda x: x.name == 'b')]
Out[23]: 
[Foo(number=10, name=a),
 Foo(number=9, name=a),
 Foo(number=8, name=a),
 Foo(number=7, name=a)]

In [24]: l[binary_search(l, lambda x: x.name == 'b'):]
Out[24]: [Foo(number=5, name=b), Foo(number=4, name=b), Foo(number=3, name=b)]

However, note, that:

  1. Naive approach with O(N) complexity should take less than 1 sec to complete for 10 4 elements.
  2. While making a copy you still need to iterate over the array which results in O(N)
  3. If you are facing performance issues it is good idea to use profiler to find bottlenecks in your program. Iterating over 10 4 elements is usually not a bottleneck (except if you are iterating 10 4 times over 10 4 elements - which results in 10 8 ). However, querying 10 4 from db may be a bottleneck as it also uses network, may query other items and so on. When in doubt - use profiler

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM