I am trying to merge two pandas dataframes that have duplicate rows (here the rows consisting of 2's corresponding to 'a' and 'b') among the entries I am trying to merge. As a result, pandas is taking a cartesian product of the duplicate rows as shown below:
In [8]: df1 = pd.DataFrame({'a' : [1, 2, 2], 'b' : [2, 2, 2], 'c' : [3, 6, 6]})
In [9]: df2 = pd.DataFrame({'a' : [2, 2], 'b' : [2, 2], 'd' : [2, 5]})
In [10]: df1.merge(df2, how='outer', on=['a', 'b'])
Out[10]:
a b c d
0 1 2 3 NaN
1 2 2 6 2.0
2 2 2 6 5.0
3 2 2 6 2.0
4 2 2 6 5.0
The result I want is to only have the merge done once between each duplicate row, in the order that they appear (in this case numerically by the index). So the output that I would like to have is:
In [12]: df_output = pd.DataFrame({'a' : [1, 2, 2], 'b' : [2, 2, 2], 'c' : [3, 6
...: , 6], 'd' : [np.nan, 2, 5]})
In [13]: df_output
Out[13]:
a b c d
0 1 2 3 NaN
1 2 2 6 2.0
2 2 2 6 5.0
How would I do this?
You need helper column by counter created by GroupBy.cumcount
:
df1 = pd.DataFrame({'a' : [1, 2, 2], 'b' : [2, 2, 2], 'c' : [3, 6, 6]})
df2 = pd.DataFrame({'a' : [2, 2], 'b' : [2, 2], 'd' : [2, 5]})
df1['g'] = df1.groupby(['a', 'b']).cumcount()
df2['g'] = df2.groupby(['a', 'b']).cumcount()
df = df1.merge(df2, how='outer', on=['a', 'b', 'g'])
print (df)
a b c g d
0 1 2 3 0 NaN
1 2 2 6 0 2.0
2 2 2 6 1 5.0
Last remove g
column:
df = df1.merge(df2, how='outer', on=['a', 'b', 'g']).drop('g', axis=1)
print (df)
a b c d
0 1 2 3 NaN
1 2 2 6 2.0
2 2 2 6 5.0
Doesn't drop_duplicates
solve your problem?
df = df1.merge(df2, how='outer', on=['a', 'b'])
df = df.drop_duplicates()
我认为就足够了
df1.merge(df2, how = 'outer').drop_duplicates()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.