How do I filter a list of tuples efficiently with Python based on whether the first item is the same as the third?
Suppose I have old_data and I want new_data:
old_data = [(2,3,2), (3,4,4), (7,6,7), (2,1,2), (5,7,2)]
new_data = [(3,4,4), (5,7,2)]
My current solution (list comprehension) is too slow:
new_data_too_slow = [x for x in old_data if x[0] != x[2]]
This data is many millions of rows, and I do need to return a list of tuples in the same format.
I'm not sure how you're using your data (this is important!) but changing to a generator may give you a performance boost.
All you have to do, is change your [
s to (
s.
new_data_too_slow = (x for x in old_data if x[0] != x[2])
Again, it depends on how you're using it but this method will easily outperform most IO operations. Also because it's a generator, you get one use out of it - but you will use significantly less memory.
This could be done efficiently using numpy
since its operations are coded in C.
In [16]: import numpy as np
In [17]: old_data = [(2,3,2), (3,4,4), (7,6,7), (2,1,2), (5,7,2)]
In [18]: np_data = np.asarray(old_data)
In [19]: new_data = np_data[ np_data[:,0] != np_data[:,2 ] ]
In [20]: new_data
Out[20]: array([[3, 4, 4],
[5, 7, 2]])
The comparison of the first and third items will be much faster this way. The conversion to numpy
will not necessarily be expensive, because np.asarray
(as opposed to just np.array
) does not copy the original data unless it has to, it just wraps it.
At this point new_data
is a numpy
array, which you can iterate over just as if it were a list of tuples if that suffices, but you can easily turn it into a list of lists...
In [22]: new_data.tolist()
Out[22]: [[3, 4, 4], [5, 7, 2]]
...and then into a list of tuples with a list comprehension if it is really necessary for your purposes.
Here are some timings of the comparison part on some generated data with a million rows with all elements either 0 or 1.
In [58]: test_data = np.random.randint( 0, 2, size=(1000000,3) )
In [59]: test_data
Out[59]:
array([[1, 1, 0],
[1, 0, 0],
[0, 1, 0],
...,
[0, 0, 1],
[0, 1, 0],
[0, 1, 1]])
In [60]: %%timeit
new_data = test_data[test_data[:,0] != test_data[:,2]]
....:
10 loops, best of 3: 26.2 ms per loop
In [61]: test_data = test_data.tolist()
In [62]: %%timeit
new_data = [ x for x in test_data if x[0] != x[2] ]
....:
1 loop, best of 3: 345 ms per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.