简体   繁体   中英

Filter list of tuples fast

How do I filter a list of tuples efficiently with Python based on whether the first item is the same as the third?

Suppose I have old_data and I want new_data:

old_data = [(2,3,2), (3,4,4), (7,6,7), (2,1,2), (5,7,2)]

new_data = [(3,4,4), (5,7,2)]

My current solution (list comprehension) is too slow:

new_data_too_slow = [x for x in old_data if x[0] != x[2]]

This data is many millions of rows, and I do need to return a list of tuples in the same format.

I'm not sure how you're using your data (this is important!) but changing to a generator may give you a performance boost.

All you have to do, is change your [ s to ( s.

new_data_too_slow = (x for x in old_data if x[0] != x[2])

Again, it depends on how you're using it but this method will easily outperform most IO operations. Also because it's a generator, you get one use out of it - but you will use significantly less memory.

This could be done efficiently using numpy since its operations are coded in C.

In [16]: import numpy as np

In [17]: old_data = [(2,3,2), (3,4,4), (7,6,7), (2,1,2), (5,7,2)]

In [18]: np_data = np.asarray(old_data)

In [19]: new_data = np_data[ np_data[:,0] != np_data[:,2 ] ]

In [20]: new_data
Out[20]: array([[3, 4, 4],
                [5, 7, 2]])

The comparison of the first and third items will be much faster this way. The conversion to numpy will not necessarily be expensive, because np.asarray (as opposed to just np.array ) does not copy the original data unless it has to, it just wraps it.

At this point new_data is a numpy array, which you can iterate over just as if it were a list of tuples if that suffices, but you can easily turn it into a list of lists...

In [22]: new_data.tolist()
Out[22]: [[3, 4, 4], [5, 7, 2]]

...and then into a list of tuples with a list comprehension if it is really necessary for your purposes.

Here are some timings of the comparison part on some generated data with a million rows with all elements either 0 or 1.

In [58]: test_data = np.random.randint( 0, 2, size=(1000000,3) )

In [59]: test_data
Out[59]: 
array([[1, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       ..., 
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 1]])

In [60]: %%timeit                                              
new_data = test_data[test_data[:,0] != test_data[:,2]]
   ....: 
10 loops, best of 3: 26.2 ms per loop

In [61]: test_data = test_data.tolist()                   

In [62]: %%timeit                                               
new_data = [ x for x in test_data if x[0] != x[2] ]
   ....: 
1 loop, best of 3: 345 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM