简体   繁体   English

在 Python 中使用 Lambda 过滤数据帧

[英]Filtering a dataframe using Lambda in Python

I have two data frames in python: df and list:我在 python 中有两个数据框:df 和 list:

data1 = [[0, ("a","b")], [1, ("d","e")], [2, ("a","e")],[3,("f", "g")],[4,("c","h")]]
df = pd.DataFrame(data1, columns = ['Row', 'Letters'])
data2 = [[0,"a"],[1,"b"],[2,"c"]]
list = pd.DataFrame(data2, columns = ['Row', 'Letters'])

I now want to filter down df for only rows such that any item in df['Letters'] is found in list['Letters']我现在只想过滤 df 的行,以便在 list['Letters'] 中找到 df['Letters'] 中的任何项目

The Any function works fine for individual rows: Any 函数适用于单个行:

any(item in df["Letters"][1] for item in list['Letters'])
any(item in df["Letters"][2] for item in list['Letters'])

correctly returns False and True, respectively.分别正确返回 False 和 True。

Now how do I filter down the entire dataframe?现在如何过滤整个数据框?

I tried the following code:我尝试了以下代码:

new_df = df[df.apply(lambda x : any(item in x["Letters"] for item in list), axis=1)]

which returns an empty dataframe when I want to return only rows 0, 2 and 4.当我只想返回第 0、2 和 4 行时,它返回一个空数据帧。

Any help would be appreciated.任何帮助,将不胜感激。

You can use a dataframe constructor with stack then compare using series.isin with any for level=0您可以使用带有stack的数据帧构造函数,然后使用series.isinany for level=0

df[pd.DataFrame(df['Letters'].tolist()).stack().isin(list_['Letters']).any(level=0)]

  Row Letters
0    0  (a, b)
2    2  (a, e)
4    4  (c, h)

Note: I have change the list variable where you save the list as list_ since you should not have a variable name same as a builtin function注意:我已经更改了将list保存为list_list变量,因为您不应该使用与内置函数相同的变量名

benchmarking for a larger dataframe:更大数据框的基准测试:

m = pd.concat([df]*10000,ignore_index=True)
%%timeit
m[pd.DataFrame(m['Letters'].tolist()).stack().isin(list_['Letters']).any(level=0)]
#25.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
m.loc[~m['Letters'].apply(lambda x: set(x).isdisjoint(set(list_['Letters'])))]
#644 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
m[m.Letters.apply(lambda x : any(item in list_.Letters.to_numpy().tolist() for item in x))]
#665 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
m.loc[m['Letters'].apply(lambda x: len(set(x).intersection(set(list_['Letters']))) > 0)]
#707 ms ± 56.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here's a way using set intersection, we convert each tuple into set and check if len of set intersection is > 1 :这是使用集合交集的一种方法,我们将每个元组转换为集合并检查set intersection len是否> 1

df.loc[df['Letters'].apply(lambda x: len(set(x).intersection(set(lst['Letters']))) > 0)]

   Row Letters
0    0  (a, b)
2    2  (a, e)
4    4  (c, h)

You can also use isdisjoint method to get the result您也可以使用isdisjoint method来获取结果

df.loc[~df['Letters'].apply(lambda x: set(x).isdisjoint(set(lst['Letters'])))]

You can do this way:你可以这样做:

data1 = [[0, ("a","b")], [1, ("d","e")], [2, ("a","e")],[3,("f", "g")],[4,("c","h")]]
df = pd.DataFrame(data1, columns = ['Row', 'Letters'])
data2 = [[0,"a"],[1,"b"],[2,"c"]]
list1 = pd.DataFrame(data2, columns = ['Row', 'Letters'])

new_df = df[df.Letters.apply(lambda x : any(item in list1.Letters.to_numpy().tolist() for item in x))]
print(new_df)

Output输出

   Row Letters
0    0  (a, b)
2    2  (a, e)
4    4  (c, h)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM