简体   繁体   中英

Fastest way to check if a value or list of values is a subset of a list in python

I have a very large list of lists called main_list , holding about 13 million lists, with each of these lists holding 6 numbers. I'm looking for a way to filter out any list that doesn't contain certain values. For example, to create a new list of lists that include only lists with the values of 4 and 5 my code works as following:

and_include = []
temp_list=[4,5]
for sett in main_list:
    if set(temp_list).issubset(sett):
        and_include.append(sett)

This takes about 5 seconds to run which can be quite annoying for frequent use so I was wondering if there's a faster way to do this, using numpy or cython?

I'm not very familiar with cython but i tried implementing this way, compiled it and all but I got an error.

def andinclude(list main_list,list temp_list):
    and_include=[]
    for sett in main_list:
        if set(temp_list).issubset(sett):
            and_include.append(sett)
    return and_include

Hopefully there's a faster way?

Here is a numpy solution:

import numpy as np

# Randomly generate 2d array of integers
np.random.seed(1)
a = np.random.randint(low=0, high=9, size=(13000000, 6))

# Use numpy indexing to filter rows
results = a[(a == 4).any(axis=1) & (a == 5).any(axis=1)]

Results:

In [35]: print(results.shape)
(3053198, 6)

In [36]: print(results[:5])
[[5 5 4 5 5 1]
 [5 5 4 3 8 6]
 [2 5 8 1 1 4]
 [0 5 4 1 1 5]
 [3 2 5 2 4 6]]

Timing:

In [37]: %timeit results = a[(a == 4).any(axis=1) & (a == 5).any(axis=1)]
923 ms ± 38.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you need your results converted back to a list of lists rather than a 2d numpy array, you can use:

l = results.tolist()

This added about 50% to the time taken to run on my machine, but should still be faster than any solution involving looping over Python lists.

You can use list comprehension instead of appending in the loop. Also, you might want to store the result of set(temp_list) in a local variable so you're not calling set 13 million times for the same result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM