简体   繁体   English


[英]Subset pandas dataframe on multiple columns based on values from another dataframe

I have two dataframes as 我有两个数据帧

import pandas as pd
points = pd.DataFrame({'player':['a','b','c','d','e'],'points':[2,5,3,6,1]})
matches = pd.DataFrame({'p1':['a','c','e'], 'p2':['c', 'b', 'd']})

I want to retain only those rows from dataframe matches where both p1 and p2 have points greater than 2. Right now I am first merging points and matches on p1 and player then merging resulting dataframe and points on p2 and player. 我想只保留数据帧匹配中的那些行,其中p1和p2都有大于2的点。现在我首先在p1和播放器上合并点和匹配,然后在p2和播放器上合并结果数据帧和点。 After this applying filter on both points columns of resulting dataframe. 在结果数据帧的两个点列上应用此筛选器之后。

new_df = pd.merge(matches, points, how = 'left', left_on = 'p1', right_on = 'player')
new_df = pd.merge(new_df, points, how = 'left', left_on = 'p2', right_on = 'player')
new_df = new_df[(new_df.points_x >2) & (new_df.points_y >2)]

This gives me what I require but I was wondering what would be a better and efficient way to do this? 这给了我我的要求,但我想知道什么是更好,更有效的方法呢?

I would avoid the joins in this case and write it like this: 在这种情况下,我会避免连接,并像这样写:

scorers = points.query('points > 2').player
matches.query('p1 in @scorers and p2 in @scorers')

I think it's more readable. 我认为它更具可读性。

It feels a little silly to benchmark on such a small example, but on my machine this method runs on average in 2.99ms while your original method takes 4.45ms. 在这么小的例子上进行基准测试感觉有点傻,但在我的机器上,这种方法平均运行2.99ms而原始方法需要4.45ms。 It would be interesting to find if this scales better or not. 如果这种扩展更好或更好,将会很有趣。

I don't know if there are other micro optimizations you could make to this code like converting scorers to a set. 我不知道你是否可以对此代码进行其他微优化,例如将scorers转换为集合。

If you don't like the query syntax: 如果您不喜欢query语法:

scorers = points[points.points > 2].player
matches[matches.p1.isin(scorers) & matches.p2.isin(scorers)]

This has better performance as well, taking about 1.36ms. 这也有更好的性能,大约需要1.36ms。

As an alternative, you can construct a series mapping players to points, then use pd.Series.map for each series in matches : 作为替代方案,您可以构建一系列将玩家映射pd.Series.map ,然后对matches每个系列使用pd.Series.map

s = points.set_index('player')['points']
res = matches.loc[matches.apply(lambda x: x.map(s)).gt(2).all(1)]


  p1 p2
1  c  b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM