简体   繁体   English

自我加入熊猫产生不必要的重复

[英]Self-Join in Pandas producing unwanted Duplicates

I have data in a Pandas dataframe in the format: 我在Pandas数据框中有以下格式的数据:

CompanyA, CompanyB, Currency, Item, Amount

Typical rows might be: 典型的行可能是:

Microsoft,Oracle,USD,Item_X,252.23
Microsoft,Oracle,USD,Item_Y,234.23
Microsoft,Oracle,EUR,Item_X,23352.00
Microsoft,Oracle,EUR,Item_Y,23344.80
Microsoft,IBM,GBP,Item_X,123.12
Microsoft,IBM,GBP,Item_Y,432.12
Oracle,IBM,EUR,Item_X,999.23
Oracle,IBM,EUR,Item_Y,234.23

Amount if a float, the others are strings. 如果为浮点数,则其他为字符串。

I want to expand the Item column out so each Item entry gets its own column, which now contains the Amount. 我想扩展Item列,以便每个Item条目都有自己的列,该列现在包含Amount。 Basically make the data wider, rather than longer. 基本上使数据更宽,而不是更长。

CompanyA, CompanyB, Currency, Item_X, Item_Y

Microsoft,Oracle,USD,252.23, 234.23
Microsoft,Oracle,EUR,,23352.00,23344.80
... and so on.

It feels like it should be a self-join - and I've tried things like: 感觉应该是自我连接-我已经尝试过类似的事情:

df = pd.merge(df, df, on=['CompanyA', 'CompanyB', 'Currency'])

This produces almost the right output, but it joins each row 4 times: 这将产生几乎正确的输出,但是它将每一行连接4次:

Item_X -> Item_X
Item_X -> Item_Y
Item_Y -> Item_X
Item_Y -> Item_Y

Obviously I'm only interested in Item_X -> Item_Y. 显然,我只对Item_X-> Item_Y感兴趣。

In SQL you would further contrain the query, and this is where I'm getting stuck - how do this in Pandas? 在SQL中,您将进一步限制查询,这就是我遇到的问题-在Pandas中如何做到这一点? Or is there an easier approach! 还是有更简单的方法!

Cheers! 干杯!

Phil. 菲尔。

I think you need set_index with unstack : 我认为你需要set_indexunstack

df = df.set_index(['CompanyA','CompanyB','Currency','Item'])['Amount']
       .unstack()
       .reset_index()

print (df)
Item   CompanyA CompanyB Currency    Item_X    Item_Y
0     Microsoft      IBM      GBP    123.12    432.12
1     Microsoft   Oracle      EUR  23352.00  23344.80
2     Microsoft   Oracle      USD    252.23    234.23
3        Oracle      IBM      EUR    999.23    234.23

Or if duplicates need pivot_table with aggregate function: 或者如果重复项需要pivot_table带有汇总功能的pivot_table

print (df)
    CompanyA CompanyB Currency    Item    Amount
0  Microsoft   Oracle      USD  Item_X    252.23
1  Microsoft   Oracle      USD  Item_Y    234.23
2  Microsoft   Oracle      EUR  Item_X  23352.00
3  Microsoft   Oracle      EUR  Item_Y  23344.80
4  Microsoft      IBM      GBP  Item_X    123.12
5  Microsoft      IBM      GBP  Item_Y    432.12
6     Oracle      IBM      EUR  Item_X    999.23
7     Oracle      IBM      EUR  Item_Y     10.00 <-same values, only Amount different
8     Oracle      IBM      EUR  Item_Y     20.00 <-same values, only Amount different


df = df.pivot_table(index=['CompanyA','CompanyB','Currency'],
                    columns='Item', 
                    values='Amount', 
                    aggfunc='mean').reset_index()
print (df)
Item   CompanyA CompanyB Currency    Item_X    Item_Y
0     Microsoft      IBM      GBP    123.12    432.12
1     Microsoft   Oracle      EUR  23352.00  23344.80
2     Microsoft   Oracle      USD    252.23    234.23
3        Oracle      IBM      EUR    999.23     15.00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM