简体   繁体   中英

duplication on combining 2 pandas dataframes

I am trying to understand why merge function is duplicating values.

>>> c2.head()
Out[42]:                  
Bin        Date/Time       val         
A    10/31/2017 15:53:57   0.77
A    10/31/2017 15:53:57   0.75
A    10/31/2017 15:53:57   0.79
A    10/31/2017 15:53:57   0.67
A    10/31/2017 15:53:57   0.72

>>> c1.head()
Out[44]: 
  Bin   Date/Time          code
  A  10/31/2017 15:53:57   BYM
  A  10/31/2017 15:53:57   CFS
  A  10/31/2017 15:53:57   DFZ
  A  10/31/2017 15:53:57   HKN
  A  10/31/2017 15:53:57   RBF

I need to merge these 2 on Bin and Datetime.

>>> c= c1.merge(c2, on =['Bin','Date/Time'], how= 'left')

>>> c.head()
Out[50]: 
  Bin       Date/Time      Code  Val
  A  10/31/2017 15:53:57   BYM   0.77
  A  10/31/2017 15:53:57   BYM   0.77
  A  10/31/2017 15:53:57   BYM   0.77
  A  10/31/2017 15:53:57   BYM   0.77
  A  10/31/2017 15:53:57   BYM   0.77

So c has multiple entries for the same bin/datetime. I thought that maybe the datetime values look the same but are different. But that's not the case.

>>> c1['Date/Time'].iloc[0]
Out[46]: u'10/31/2017 15:53:57'
>>> c2['Date/Time'].iloc[0]
Out[47]: u'10/31/2017 15:53:57'
>>> c1['Date/Time'].iloc[0]==c2['Date/Time'].iloc[0]
Out[48]: True

In addition, even if datetime was different, there should be only 2 lines for each bin/datetime. Any idea what might be happening here?

My intended output is:

  Bin       Date/Time      Code  Val
  A  10/31/2017 15:53:57   BYM   0.77
  A  10/31/2017 15:53:57   CFS   0.75
  A  10/31/2017 15:53:57   DFZ   0.79
  A  10/31/2017 15:53:57   HKN   0.67
  A  10/31/2017 15:53:57   RBF   0.72

Duplicating values happen because of unique val s in c2 .

Simplified example:

>>> c1.head(1)
  Bin           Date/Time code
0   A 2017-10-31 15:53:57  BYM

Merge this 1 row with c2 :

>>> c1.head(1).merge(c2, on=['Bin','Date/Time'], how='left')
  Bin           Date/Time code   val
0   A 2017-10-31 15:53:57  BYM  0.77
1   A 2017-10-31 15:53:57  BYM  0.75
2   A 2017-10-31 15:53:57  BYM  0.79
3   A 2017-10-31 15:53:57  BYM  0.67
4   A 2017-10-31 15:53:57  BYM  0.72

You are merging on two keys ['Bin','Date/Time'] and for each code in c1, it's bringing over each unique val from c2.

It doesn't appear you need a merge. If the 2 dataframes have the same size and index, then you can simply assign one series to another:

c1.val = c2.val

Sometimes, you may wish to copy across multiple series from one dataframe to another. Instead of looping over multiple columns, this can be achieved via combine_first :

c1.combine_first(c2)

This gives priority to c1 in case of common indices, but it will not matter if the only difference is one dataframe has an extra column.

If indices are different, you may wish to realign them via .reset_index() before either of the above methods.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM