简体   繁体   中英

Pandas merge with duplicated key - removing duplicated rows or preventing it's creation

I have two dataframes that i want to merge, but my key column contains duplicates. Dataframes looks like this:

Name,amount,id
John,500.25,GH10
Helen,1250.00,GH11
Adam,432.54,GH11
Sarah,567.12,GH12

Category,amount,id
Food,500.25,GH10
Travel,1250.00,GH11
Food,432.54,GH11

And I'm performing on it merge with outer join to include everything in merged table:

merged_table = pd.merge(df1,df2,on="id",how='outer')

And my output is:

Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Helen,1250.00,GH11,Food,432.54
Adam,432.54,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12

However, my desired output is:

Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12

So what's happening here is that each record with duplicated key is matched with every record on other table, so the output have 4 rows instead of 2, and these two in the middle (row 2 and 3) are unwanted.

So the solutions that comes to my mind:

  1. Preventing somehow creation of duplicated rows. I can't use drop_duplicates() before merge, because then i would exclude some of the rows with doubled key. But the other column, Amount, should have the same 2 values on both tables, but there is very small possibility that they may differ.

  2. Using merge in the same way as i'm doing it, but then dropping rows 2 and 3 and keeping rows 1 and 4, if ID is duplicated, because as matching goes in way where first row in df1 is connected with first row in df2, then second row in df2, and then second row from df1 is connected with first row in df2 and then with the second, rows 1 and 4 are the one that are correct.

I'm think here of using .apply() and writing some lambda function, but i can't really wrap my head around how it should be written correctly.

I suggest create new helper column for count id values by cumcount and then merge by this values:

df1['g'] = df1.groupby('id').cumcount()
df2['g'] = df2.groupby('id').cumcount()

merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer')
print (merged_table)
    Name  amount_x    id  g Category  amount_y
0   John    500.25  GH10  0     Food    500.25
1  Helen   1250.00  GH11  0   Travel   1250.00
2   Adam    432.54  GH11  1     Food    432.54
3  Sarah    567.12  GH12  0      NaN       NaN

And last remove id :

merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer').drop('g', axis=1)
print (merged_table)
    Name  amount_x    id Category  amount_y
0   John    500.25  GH10     Food    500.25
1  Helen   1250.00  GH11   Travel   1250.00
2   Adam    432.54  GH11     Food    432.54
3  Sarah    567.12  GH12      NaN       NaN 

Detail :

print (df1)
    Name   amount    id  g
0   John   500.25  GH10  0
1  Helen  1250.00  GH11  0
2   Adam   432.54  GH11  1
3  Sarah   567.12  GH12  0

print (df2)
  Category   amount    id  g
0     Food   500.25  GH10  0
1   Travel  1250.00  GH11  0
2     Food   432.54  GH11  1

On your output(after merge ) you can apply below. Also in a single shot we can do it but I suggest you to figure it out. Giving you the hint...

>>> df.drop_duplicates('Name',keep='first')
    Name  amount_x    id category amount_y
0   John    500.25  GH10     Food   500.25
1  Helen   1250.00  GH11   Travel     1250
3   Adam    432.54  GH11   Travel     1250
5  Sarah    567.12  GH12       

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM