使用条件在python熊猫中进行内部联接

Question

我试图在python / pandas中复制在SQL中相当简单的方法，但是被卡住了。

我想要一个包含三列的数据框：

dataframe1

    Org Des Score
0   A   B   10
1   A   B   11
2   A   B   15
3   A   C   4
4   A   C   4.5
5   A   C   6
6   A   D   100
7   A   D   110
8   A   D   130

并针对每个Org-Des组合筛选出所有大于最小值* 1.2的分数值。

因此输出表将是：

output_dataframe

    Org Des Score
0   A   B   10
1   A   B   11
3   A   C   4
4   A   C   4.5
6   A   D   100
7   A   D   110

对于第一个Org-Des组合AB，最小分数是10，（1.2 *最小值）=12。因此，保留行0和1是因为分数10和11 <12。因为它是>，所以将删除第3行12

对于AC，最小分数为4，而（1.2 * min）=5。因此保留行3和4，因为它们小于5。依此类推...

我的方法

我以为我会使用以下方法：

使用groupby函数创建带有Mins by Org-Des对的数据框：

 dataframe2 = pd.DataFrame(dataframe1.groupby(['Org','Des'])['Score'].min())

然后在dataframe1和dataframe2之间进行内部联接（或合并？），条件是每种Org-Des对类型的Score <1.2 * min。

但由于以下两个原因，我无法使它正常工作：1）dataframe2最终变成了时髦的形状，我需要弄清楚如何将其与dataframe1合并或合并，或者先进行转换再合并/合并，然后2）我不知道如何在联接/合并中设置条件。

这是正确的方法还是实现同一目标的更多Python方法？

编辑以反映@Psidom答案：

我尝试了您建议的代码，但给了我一个错误，这是完整的代码和输出：

In: import pandas as pd 
    import numpy as np 

In: df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
                        'Des': ['B','B','B','C','C','C','D','D','D'],
                        'Score': ['10','11','15','4','4.5','6','100','110','130'], })

Out:    Org Des Score
    0   A   B   10
    1   A   B   11
    2   A   B   15
    3   A   C   4
    4   A   C   4.5
    5   A   C   6
    6   A   D   100
    7   A   D   110
    8   A   D   130

In: df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
    df2

Out:        Score
    Org Des 
    A   B   10
        C   4
        D   100

In: df1 = pd.merge(df1, df2.groupby(['Org', 'Des']).min()*1.2, left_on = ['Org', 'Des'], right_index=True)
    df.loc[df1.Score_x < df1.Score_y, :]

Out: KeyError: 'Org' #It's a big error but this seems to be the relevant part.  Let me know if it would be useful to past the whole error.

我怀疑我可能将df1，df2和df混合了吗？ 我从原始答案中更改为匹配代码。

Answer 1

您可以这样设置连接条件。 对于原始数据帧，将连接列设置为['Org', 'Des'] ，对于聚合数据帧，将分组的列设置为索引，因此您需要将right_index设置为true，然后它将按预期工作：

import pandas as pd
df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
                    'Des': ['B','B','B','C','C','C','D','D','D'],
                    'Score': [10,11,15,4,4.5,6,100,110,130]})
df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())

df3 = pd.merge(df1, df2, left_on = ['Org', 'Des'], right_index=True)
df1.loc[df3.Score_x < df3.Score_y * 1.2, ]

#  Org  Des Score
#0  A   B   10.0
#1  A   B   11.0
#3  A   C   4.0
#4  A   C   4.5
#6  A   D   100.0
#7  A   D   110.0

Answer 2

我这样做是这样的：

df[df.groupby(['Org', 'Des']).Score.apply(lambda x: x < x.min() * 1.2)]

使用条件在python熊猫中进行内部联接

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-07-08 23:06:56

解决方案2
2 2016-07-09 07:18:50

使用条件在python熊猫中进行内部联接

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-07-08 23:06:56

解决方案2 2 2016-07-09 07:18:50

解决方案1
2 已采纳 2016-07-08 23:06:56

解决方案2
2 2016-07-09 07:18:50