简体   繁体   中英

How to combine dataframes with redundant rows in pandas

I want to combine two data frames with different and overlapping columns:

df1
    X   a   Y   b     c
A   P   1   Q   21    1.135899
B   P   2   Q   22    1.093204
C   P   3   Q   23    2.035373
D   P   4   Q   24    0.350060
E   P   5   Q   25   -0.939962

df2
    a    b     d
A   1    21    5.5
A   1    21    3.3
A   1    21    2.1
B   2    22    0.8
B   2    22    0.5
C   3    23    1.3
C   3    23    6.5
C   3    23    7.1

I would like to combine both data frames in this way:

df3
    a    b   c          d
A   1    21  1.135899   5.5
A   1    21  1.135899   3.3
A   1    21  1.135899   2.1
B   2    22  1.093204   0.8
B   2    22  1.093204   0.5
C   3    23  2.035373   1.3
C   3    23  2.035373   6.5
C   3    23  2.035373   7.1

How can I achieve this?

Try a left merge. To maintain index, you will need use reset_index before and set_index after the marge.

res = df2.reset_index()\
         .merge(df1, how='left')\
         .set_index('index')\
         .loc[:, ['a', 'b', 'c', 'd']]

print(res)

#        a   b         c    d
# index                      
# A      1  21  1.135899  5.5
# A      1  21  1.135899  3.3
# A      1  21  1.135899  2.1
# B      2  22  1.093204  0.8
# B      2  22  1.093204  0.5
# C      3  23  2.035373  1.3
# C      3  23  2.035373  6.5
# C      3  23  2.035373  7.1

For append all columns from df1 if not exist in df2 filter by difference and join by default left join:

df = df2.join(df1[df1.columns.difference(df2.columns)])
print (df)
   a   b    d  X  Y         c
A  1  21  5.5  P  Q  1.135899
A  1  21  3.3  P  Q  1.135899
A  1  21  2.1  P  Q  1.135899
B  2  22  0.8  P  Q  1.093204
B  2  22  0.5  P  Q  1.093204
C  3  23  1.3  P  Q  2.035373
C  3  23  6.5  P  Q  2.035373
C  3  23  7.1  P  Q  2.035373

And if need only some columns add subset by list :

df = df2.join(df1[df1.columns.difference(df2.columns)])[['a','b','c','d']]
print (df)
   a   b         c    d
A  1  21  1.135899  5.5
A  1  21  1.135899  3.3
A  1  21  1.135899  2.1
B  2  22  1.093204  0.8
B  2  22  1.093204  0.5
C  3  23  2.035373  1.3
C  3  23  2.035373  6.5
C  3  23  2.035373  7.1

Detail:

print (df1[df1.columns.difference(df2.columns)])
   X  Y         c
A  P  Q  1.135899
B  P  Q  1.093204
C  P  Q  2.035373
D  P  Q  0.350060
E  P  Q -0.939962

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM