[英]Filtering a dataframe using Lambda in Python
I have two data frames in python: df and list:我在 python 中有两个数据框:df 和 list:
data1 = [[0, ("a","b")], [1, ("d","e")], [2, ("a","e")],[3,("f", "g")],[4,("c","h")]]
df = pd.DataFrame(data1, columns = ['Row', 'Letters'])
data2 = [[0,"a"],[1,"b"],[2,"c"]]
list = pd.DataFrame(data2, columns = ['Row', 'Letters'])
I now want to filter down df for only rows such that any item in df['Letters'] is found in list['Letters']我现在只想过滤 df 的行,以便在 list['Letters'] 中找到 df['Letters'] 中的任何项目
The Any function works fine for individual rows: Any 函数适用于单个行:
any(item in df["Letters"][1] for item in list['Letters'])
any(item in df["Letters"][2] for item in list['Letters'])
correctly returns False and True, respectively.分别正确返回 False 和 True。
Now how do I filter down the entire dataframe?现在如何过滤整个数据框?
I tried the following code:我尝试了以下代码:
new_df = df[df.apply(lambda x : any(item in x["Letters"] for item in list), axis=1)]
which returns an empty dataframe when I want to return only rows 0, 2 and 4.当我只想返回第 0、2 和 4 行时,它返回一个空数据帧。
Any help would be appreciated.任何帮助,将不胜感激。
You can use a dataframe constructor with stack
then compare using series.isin
with any
for level=0
您可以使用带有stack
的数据帧构造函数,然后使用series.isin
与any
for level=0
df[pd.DataFrame(df['Letters'].tolist()).stack().isin(list_['Letters']).any(level=0)]
Row Letters
0 0 (a, b)
2 2 (a, e)
4 4 (c, h)
Note: I have change the list
variable where you save the list as list_
since you should not have a variable name same as a builtin function注意:我已经更改了将list
保存为list_
的list
变量,因为您不应该使用与内置函数相同的变量名
benchmarking for a larger dataframe:更大数据框的基准测试:
m = pd.concat([df]*10000,ignore_index=True)
%%timeit
m[pd.DataFrame(m['Letters'].tolist()).stack().isin(list_['Letters']).any(level=0)]
#25.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
m.loc[~m['Letters'].apply(lambda x: set(x).isdisjoint(set(list_['Letters'])))]
#644 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m[m.Letters.apply(lambda x : any(item in list_.Letters.to_numpy().tolist() for item in x))]
#665 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.loc[m['Letters'].apply(lambda x: len(set(x).intersection(set(list_['Letters']))) > 0)]
#707 ms ± 56.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here's a way using set intersection, we convert each tuple into set and check if len
of set intersection
is > 1
:这是使用集合交集的一种方法,我们将每个元组转换为集合并检查set intersection
len
是否> 1
:
df.loc[df['Letters'].apply(lambda x: len(set(x).intersection(set(lst['Letters']))) > 0)]
Row Letters
0 0 (a, b)
2 2 (a, e)
4 4 (c, h)
You can also use isdisjoint method
to get the result您也可以使用isdisjoint method
来获取结果
df.loc[~df['Letters'].apply(lambda x: set(x).isdisjoint(set(lst['Letters'])))]
You can do this way:你可以这样做:
data1 = [[0, ("a","b")], [1, ("d","e")], [2, ("a","e")],[3,("f", "g")],[4,("c","h")]]
df = pd.DataFrame(data1, columns = ['Row', 'Letters'])
data2 = [[0,"a"],[1,"b"],[2,"c"]]
list1 = pd.DataFrame(data2, columns = ['Row', 'Letters'])
new_df = df[df.Letters.apply(lambda x : any(item in list1.Letters.to_numpy().tolist() for item in x))]
print(new_df)
Output输出
Row Letters
0 0 (a, b)
2 2 (a, e)
4 4 (c, h)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.