简体   繁体   中英

Check if any value in one list is present in another list (fastest solution)

I have a DataFrame with 1mln rows and 10 columns. Each column is a list of elements (it may be an empty list or a list with up to 5 elements). Let's say that I have another lsit with 100000 elements and I want to filter only those rows in DataFrame for which a given column (say columnA ) contains any element from my big list of 100000 elements. This is my current code:

df = df[df["columnA"].apply(lambda x: any(value in valuesList for value in x))]

but it takes an enormous amount of time to calculate it. How can I speed up the code?

The complexity of your algorithm is O(n^3). First n is for iterating through all rows. Second n is for iterating through all values in a cell. Third n is for iterating through the list items against which you compare the cell values (which you do by checking if a list contains a particular value). As @Marat suggests: Use a set. Checking if a set contains a particular value is constant time O(1). This reduces complexity to O(n^2).

s = set(valuesList)
df = df[df["columnA"].apply(lambda x: any(value in s for value in x))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM