简体   繁体   English

使用条件在python熊猫中进行内部联接

[英]Do an inner join in python pandas using criteria

I'm trying to replicate in python/pandas what would be fairly straightforward in SQL, but am stuck. 我试图在python / pandas中复制在SQL中相当简单的方法,但是被卡住了。

I want to take a data frame with three columns: 我想要一个包含三列的数据框:

dataframe1

    Org Des Score
0   A   B   10
1   A   B   11
2   A   B   15
3   A   C   4
4   A   C   4.5
5   A   C   6
6   A   D   100
7   A   D   110
8   A   D   130

And filter out all score values that are greater than the minimum * 1.2 for each Org-Des combination. 并针对每个Org-Des组合筛选出所有大于最小值* 1.2的分数值。

So the output table would be: 因此输出表将是:

output_dataframe

    Org Des Score
0   A   B   10
1   A   B   11
3   A   C   4
4   A   C   4.5
6   A   D   100
7   A   D   110

For the first Org-Des combo, AB, the min Score is 10 and (1.2 * min) = 12. So rows 0 and 1 would be preserved because Scores 10 and 11 are < 12. Row 3 would be eliminated because it is > 12. 对于第一个Org-Des组合AB,最小分数是10,(1.2 *最小值)=12。因此,保留行0和1是因为分数10和11 <12。因为它是>,所以将删除第3行12

For AC, the min Score is 4 and (1.2 * min) = 5. So rows 3 and 4 are preserved because they are < 5. And so on... 对于AC,最小分数为4,而(1.2 * min)=5。因此保留行3和4,因为它们小于5。依此类推...

My approach 我的方法

I thought I'd use the following approach: 我以为我会使用以下方法:

  1. Use a groupby function to create a dataframe with the mins by Org-Des pair: 使用groupby函数创建带有Mins by Org-Des对的数据框:

     dataframe2 = pd.DataFrame(dataframe1.groupby(['Org','Des'])['Score'].min()) 
  2. Then do an inner join (or a merge?) between dataframe1 and dataframe2 with the criteria that the Score < 1.2 * min for each Org-Des pair type. 然后在dataframe1和dataframe2之间进行内部联接(或合并?),条件是每种Org-Des对类型的Score <1.2 * min。

But I haven't been able to get this to work for two reasons, 1) dataframe2 ends up being a funky shape, which I would need to figure out how to join or merge with dataframe1, or transform then join/merge and 2) I don't know how to set criteria as part of a join/merge. 但由于以下两个原因,我无法使它正常工作:1)dataframe2最终变成了时髦的形状,我需要弄清楚如何将其与dataframe1合并或合并,或者先进行转换再合并/合并,然后2)我不知道如何在联接/合并中设置条件。

Is this the right approach or is there a more pythonic way to achieve the same goal? 这是正确的方法还是实现同一目标的更多Python方法?

Edit to reflect @Psidom answer: 编辑以反映@Psidom答案:

I tried the code you suggested and it gave me an error, here's the full code and output: 我尝试了您建议的代码,但给了我一个错误,这是完整的代码和输出:

In: import pandas as pd 
    import numpy as np 

In: df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
                        'Des': ['B','B','B','C','C','C','D','D','D'],
                        'Score': ['10','11','15','4','4.5','6','100','110','130'], })

Out:    Org Des Score
    0   A   B   10
    1   A   B   11
    2   A   B   15
    3   A   C   4
    4   A   C   4.5
    5   A   C   6
    6   A   D   100
    7   A   D   110
    8   A   D   130

In: df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
    df2

Out:        Score
    Org Des 
    A   B   10
        C   4
        D   100

In: df1 = pd.merge(df1, df2.groupby(['Org', 'Des']).min()*1.2, left_on = ['Org', 'Des'], right_index=True)
    df.loc[df1.Score_x < df1.Score_y, :]

Out: KeyError: 'Org' #It's a big error but this seems to be the relevant part.  Let me know if it would be useful to past the whole error.  

I suspect I may have the df1, df2 and df's mixed up? 我怀疑我可能将df1,df2和df混合了吗? I changed from the original answer post to match the code. 我从原始答案中更改为匹配代码。

You can set up the join criteria as this. 您可以这样设置连接条件。 For the original data frame, set the join columns as ['Org', 'Des'] , and for the aggregated data frame the grouped columns become index so you will need to set right_index to be true, then it should work as expected: 对于原始数据帧,将连接列设置为['Org', 'Des'] ,对于聚合数据帧,将分组的列设置为索引,因此您需要将right_index设置为true,然后它将按预期工作:

import pandas as pd
df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
                    'Des': ['B','B','B','C','C','C','D','D','D'],
                    'Score': [10,11,15,4,4.5,6,100,110,130]})
df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())

df3 = pd.merge(df1, df2, left_on = ['Org', 'Des'], right_index=True)
df1.loc[df3.Score_x < df3.Score_y * 1.2, ]

#  Org  Des Score
#0  A   B   10.0
#1  A   B   11.0
#3  A   C   4.0
#4  A   C   4.5
#6  A   D   100.0
#7  A   D   110.0

I did it like this: 我这样做是这样的:

df[df.groupby(['Org', 'Des']).Score.apply(lambda x: x < x.min() * 1.2)]

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM