I have a DataFrame
with 1mln rows and 10 columns. Each column is a list of elements (it may be an empty list or a list with up to 5 elements). Let's say that I have another lsit with 100000 elements and I want to filter only those rows in DataFrame
for which a given column (say columnA
) contains any element from my big list of 100000 elements. This is my current code:
df = df[df["columnA"].apply(lambda x: any(value in valuesList for value in x))]
but it takes an enormous amount of time to calculate it. How can I speed up the code?
The complexity of your algorithm is O(n^3). First n is for iterating through all rows. Second n is for iterating through all values in a cell. Third n is for iterating through the list items against which you compare the cell values (which you do by checking if a list contains a particular value). As @Marat suggests: Use a set. Checking if a set contains a particular value is constant time O(1). This reduces complexity to O(n^2).
s = set(valuesList)
df = df[df["columnA"].apply(lambda x: any(value in s for value in x))]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.