子集根据另一个数据帧的值在多个列上进行pandas数据帧

Question

我有两个数据帧

import pandas as pd
points = pd.DataFrame({'player':['a','b','c','d','e'],'points':[2,5,3,6,1]})
matches = pd.DataFrame({'p1':['a','c','e'], 'p2':['c', 'b', 'd']})

我想只保留数据帧匹配中的那些行，其中p1和p2都有大于2的点。现在我首先在p1和播放器上合并点和匹配，然后在p2和播放器上合并结果数据帧和点。 在结果数据帧的两个点列上应用此筛选器之后。

new_df = pd.merge(matches, points, how = 'left', left_on = 'p1', right_on = 'player')
new_df = pd.merge(new_df, points, how = 'left', left_on = 'p2', right_on = 'player')
new_df = new_df[(new_df.points_x >2) & (new_df.points_y >2)]

这给了我我的要求，但我想知道什么是更好，更有效的方法呢？

Answer 1

在这种情况下，我会避免连接，并像这样写：

scorers = points.query('points > 2').player
matches.query('p1 in @scorers and p2 in @scorers')

我认为它更具可读性。

在这么小的例子上进行基准测试感觉有点傻，但在我的机器上，这种方法平均运行2.99ms而原始方法需要4.45ms。 如果这种扩展更好或更好，将会很有趣。

我不知道你是否可以对此代码进行其他微优化，例如将scorers转换为集合。

如果您不喜欢query语法：

scorers = points[points.points > 2].player
matches[matches.p1.isin(scorers) & matches.p2.isin(scorers)]

这也有更好的性能，大约需要1.36ms。

Answer 2

作为替代方案，您可以构建一系列将玩家映射pd.Series.map ，然后对matches每个系列使用pd.Series.map ：

s = points.set_index('player')['points']
res = matches.loc[matches.apply(lambda x: x.map(s)).gt(2).all(1)]

print(res)

  p1 p2
1  c  b

子集根据另一个数据帧的值在多个列上进行pandas数据帧

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-11-30 08:33:22

解决方案2
1 2018-11-30 10:03:11

子集根据另一个数据帧的值在多个列上进行pandas数据帧

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-11-30 08:33:22

解决方案2 1 2018-11-30 10:03:11

解决方案1
2 已采纳 2018-11-30 08:33:22

解决方案2
1 2018-11-30 10:03:11