在 Python 中使用 Lambda 过滤数据帧

Question

I have two data frames in python: df and list:我在 python 中有两个数据框：df 和 list：

data1 = [[0, ("a","b")], [1, ("d","e")], [2, ("a","e")],[3,("f", "g")],[4,("c","h")]]
df = pd.DataFrame(data1, columns = ['Row', 'Letters'])
data2 = [[0,"a"],[1,"b"],[2,"c"]]
list = pd.DataFrame(data2, columns = ['Row', 'Letters'])

I now want to filter down df for only rows such that any item in df['Letters'] is found in list['Letters']我现在只想过滤 df 的行，以便在 list['Letters'] 中找到 df['Letters'] 中的任何项目

The Any function works fine for individual rows: Any 函数适用于单个行：

any(item in df["Letters"][1] for item in list['Letters'])
any(item in df["Letters"][2] for item in list['Letters'])

correctly returns False and True, respectively.分别正确返回 False 和 True。

Now how do I filter down the entire dataframe?现在如何过滤整个数据框？

I tried the following code:我尝试了以下代码：

new_df = df[df.apply(lambda x : any(item in x["Letters"] for item in list), axis=1)]

which returns an empty dataframe when I want to return only rows 0, 2 and 4.当我只想返回第 0、2 和 4 行时，它返回一个空数据帧。

Any help would be appreciated.任何帮助，将不胜感激。

Answer 1

You can use a dataframe constructor with stack then compare using series.isin with any for level=0您可以使用带有stack的数据帧构造函数，然后使用series.isin与any for level=0

df[pd.DataFrame(df['Letters'].tolist()).stack().isin(list_['Letters']).any(level=0)]

  Row Letters
0    0  (a, b)
2    2  (a, e)
4    4  (c, h)

Note: I have change the list variable where you save the list as list_ since you should not have a variable name same as a builtin function注意：我已经更改了将list保存为list_的list变量，因为您不应该使用与内置函数相同的变量名

benchmarking for a larger dataframe:更大数据框的基准测试：

m = pd.concat([df]*10000,ignore_index=True)
%%timeit
m[pd.DataFrame(m['Letters'].tolist()).stack().isin(list_['Letters']).any(level=0)]
#25.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
m.loc[~m['Letters'].apply(lambda x: set(x).isdisjoint(set(list_['Letters'])))]
#644 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
m[m.Letters.apply(lambda x : any(item in list_.Letters.to_numpy().tolist() for item in x))]
#665 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
m.loc[m['Letters'].apply(lambda x: len(set(x).intersection(set(list_['Letters']))) > 0)]
#707 ms ± 56.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 2

Here's a way using set intersection, we convert each tuple into set and check if len of set intersection is > 1 :这是使用集合交集的一种方法，我们将每个元组转换为集合并检查set intersection len是否> 1 ：

df.loc[df['Letters'].apply(lambda x: len(set(x).intersection(set(lst['Letters']))) > 0)]

   Row Letters
0    0  (a, b)
2    2  (a, e)
4    4  (c, h)

You can also use isdisjoint method to get the result您也可以使用isdisjoint method来获取结果

df.loc[~df['Letters'].apply(lambda x: set(x).isdisjoint(set(lst['Letters'])))]

Answer 3

You can do this way:你可以这样做：

data1 = [[0, ("a","b")], [1, ("d","e")], [2, ("a","e")],[3,("f", "g")],[4,("c","h")]]
df = pd.DataFrame(data1, columns = ['Row', 'Letters'])
data2 = [[0,"a"],[1,"b"],[2,"c"]]
list1 = pd.DataFrame(data2, columns = ['Row', 'Letters'])

new_df = df[df.Letters.apply(lambda x : any(item in list1.Letters.to_numpy().tolist() for item in x))]
print(new_df)

Output输出

   Row Letters
0    0  (a, b)
2    2  (a, e)
4    4  (c, h)

在 Python 中使用 Lambda 过滤数据帧

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-01-31 15:48:26

解决方案2
1 2020-01-31 15:48:51

解决方案3
1 2020-01-31 15:52:01

在 Python 中使用 Lambda 过滤数据帧

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-01-31 15:48:26

解决方案2 1 2020-01-31 15:48:51

解决方案3 1 2020-01-31 15:52:01

解决方案1
1 已采纳 2020-01-31 15:48:26

解决方案2
1 2020-01-31 15:48:51

解决方案3
1 2020-01-31 15:52:01