[英]Self-Join in Pandas producing unwanted Duplicates
I have data in a Pandas dataframe in the format: 我在Pandas数据框中有以下格式的数据:
CompanyA, CompanyB, Currency, Item, Amount
Typical rows might be: 典型的行可能是:
Microsoft,Oracle,USD,Item_X,252.23
Microsoft,Oracle,USD,Item_Y,234.23
Microsoft,Oracle,EUR,Item_X,23352.00
Microsoft,Oracle,EUR,Item_Y,23344.80
Microsoft,IBM,GBP,Item_X,123.12
Microsoft,IBM,GBP,Item_Y,432.12
Oracle,IBM,EUR,Item_X,999.23
Oracle,IBM,EUR,Item_Y,234.23
Amount if a float, the others are strings. 如果为浮点数,则其他为字符串。
I want to expand the Item column out so each Item entry gets its own column, which now contains the Amount. 我想扩展Item列,以便每个Item条目都有自己的列,该列现在包含Amount。 Basically make the data wider, rather than longer.
基本上使数据更宽,而不是更长。
CompanyA, CompanyB, Currency, Item_X, Item_Y
Microsoft,Oracle,USD,252.23, 234.23
Microsoft,Oracle,EUR,,23352.00,23344.80
... and so on.
It feels like it should be a self-join - and I've tried things like: 感觉应该是自我连接-我已经尝试过类似的事情:
df = pd.merge(df, df, on=['CompanyA', 'CompanyB', 'Currency'])
This produces almost the right output, but it joins each row 4 times: 这将产生几乎正确的输出,但是它将每一行连接4次:
Item_X -> Item_X
Item_X -> Item_Y
Item_Y -> Item_X
Item_Y -> Item_Y
Obviously I'm only interested in Item_X -> Item_Y. 显然,我只对Item_X-> Item_Y感兴趣。
In SQL you would further contrain the query, and this is where I'm getting stuck - how do this in Pandas? 在SQL中,您将进一步限制查询,这就是我遇到的问题-在Pandas中如何做到这一点? Or is there an easier approach!
还是有更简单的方法!
Cheers! 干杯!
Phil. 菲尔。
I think you need set_index
with unstack
: 我认为你需要
set_index
与unstack
:
df = df.set_index(['CompanyA','CompanyB','Currency','Item'])['Amount']
.unstack()
.reset_index()
print (df)
Item CompanyA CompanyB Currency Item_X Item_Y
0 Microsoft IBM GBP 123.12 432.12
1 Microsoft Oracle EUR 23352.00 23344.80
2 Microsoft Oracle USD 252.23 234.23
3 Oracle IBM EUR 999.23 234.23
Or if duplicates need pivot_table
with aggregate function: 或者如果重复项需要
pivot_table
带有汇总功能的pivot_table
:
print (df)
CompanyA CompanyB Currency Item Amount
0 Microsoft Oracle USD Item_X 252.23
1 Microsoft Oracle USD Item_Y 234.23
2 Microsoft Oracle EUR Item_X 23352.00
3 Microsoft Oracle EUR Item_Y 23344.80
4 Microsoft IBM GBP Item_X 123.12
5 Microsoft IBM GBP Item_Y 432.12
6 Oracle IBM EUR Item_X 999.23
7 Oracle IBM EUR Item_Y 10.00 <-same values, only Amount different
8 Oracle IBM EUR Item_Y 20.00 <-same values, only Amount different
df = df.pivot_table(index=['CompanyA','CompanyB','Currency'],
columns='Item',
values='Amount',
aggfunc='mean').reset_index()
print (df)
Item CompanyA CompanyB Currency Item_X Item_Y
0 Microsoft IBM GBP 123.12 432.12
1 Microsoft Oracle EUR 23352.00 23344.80
2 Microsoft Oracle USD 252.23 234.23
3 Oracle IBM EUR 999.23 15.00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.