Check if any value in one list is present in another list (fastest solution)

Question

I have a DataFrame with 1mln rows and 10 columns. Each column is a list of elements (it may be an empty list or a list with up to 5 elements). Let's say that I have another lsit with 100000 elements and I want to filter only those rows in DataFrame for which a given column (say columnA ) contains any element from my big list of 100000 elements. This is my current code:

df = df[df["columnA"].apply(lambda x: any(value in valuesList for value in x))]

but it takes an enormous amount of time to calculate it. How can I speed up the code?

Answer 1

The complexity of your algorithm is O(n^3). First n is for iterating through all rows. Second n is for iterating through all values in a cell. Third n is for iterating through the list items against which you compare the cell values (which you do by checking if a list contains a particular value). As @Marat suggests: Use a set. Checking if a set contains a particular value is constant time O(1). This reduces complexity to O(n^2).

s = set(valuesList)
df = df[df["columnA"].apply(lambda x: any(value in s for value in x))]

Check if any value in one list is present in another list (fastest solution)

Question

1 answers

solution1
0 2022-06-30 15:24:18

Check if any value in one list is present in another list (fastest solution)

Question

1 answers

solution1 0 2022-06-30 15:24:18

solution1
0 2022-06-30 15:24:18