检查一个列表中的任何值是否存在于另一个列表中（最快的解决方案）

Question

I have a DataFrame with 1mln rows and 10 columns.我有一个包含 100 万行和 10 列的DataFrame 。 Each column is a list of elements (it may be an empty list or a list with up to 5 elements).每列是一个元素列表（它可能是一个空列表或最多包含 5 个元素的列表）。 Let's say that I have another lsit with 100000 elements and I want to filter only those rows in DataFrame for which a given column (say columnA ) contains any element from my big list of 100000 elements.假设我有另一个具有 100000 个元素的 lsit ，我只想过滤DataFrame中给定列（例如columnA ）包含我的 100000 个元素的大列表中的任何元素的那些行。 This is my current code:这是我当前的代码：

df = df[df["columnA"].apply(lambda x: any(value in valuesList for value in x))]

but it takes an enormous amount of time to calculate it.但是计算它需要大量的时间。 How can I speed up the code?如何加快代码速度？

Answer 1

The complexity of your algorithm is O(n^3).你的算法的复杂度是 O(n^3)。 First n is for iterating through all rows.第一个 n 用于遍历所有行。 Second n is for iterating through all values in a cell.第二个 n 用于遍历单元格中的所有值。 Third n is for iterating through the list items against which you compare the cell values (which you do by checking if a list contains a particular value).第三个 n 用于遍历您比较单元格值的列表项（通过检查列表是否包含特定值来完成）。 As @Marat suggests: Use a set.正如@Marat 建议的那样：使用一套。 Checking if a set contains a particular value is constant time O(1).检查一个集合是否包含特定值是常数时间 O(1)。 This reduces complexity to O(n^2).这将复杂度降低到 O(n^2)。

s = set(valuesList)
df = df[df["columnA"].apply(lambda x: any(value in s for value in x))]

检查一个列表中的任何值是否存在于另一个列表中（最快的解决方案）

问题描述

1 个解决方案

解决方案1
0 2022-06-30 15:24:18

检查一个列表中的任何值是否存在于另一个列表中（最快的解决方案）

问题描述

1 个解决方案

解决方案1 0 2022-06-30 15:24:18

解决方案1
0 2022-06-30 15:24:18