简体   繁体   中英

Pandas: remove duplicates deletes data when concatenate data frames with DateTime Index

I have 2 dataframes(some values are duplicated eg 2020-02-13):

>>> print(df1)
                   Val
Date                
2020-02-20         152.50
2020-02-19         152.53
2020-02-18         152.20
2020-02-13         152.28

>>> print(fd2)
                   Val
Date                
2018-02-20         141.40
2018-02-21         141.37
2018-02-22         141.17
2018-02-26         141.35
2018-02-27         140.69
...                   ...
2020-02-05         152.37
2020-02-06         152.20
2020-02-10         152.03
2020-02-11         151.19
2020-02-13         152.28
[298 rows x 1 columns]

both are indexed by Date (df1.set_index('Date')), and both dataframes dates were parsed (pd.to_datetime(df1.index)). Now I want to concate them both and remove duplicates (if any). I have tried

>>> pd.concat([df1, df2])
                   Val
Date                
2018-02-20         141.40
2018-02-21         141.37
2018-02-22         141.17
2018-02-26         141.35
2018-02-27         140.69
...                   ...
2020-02-13         152.28
2020-02-20         152.50
2020-02-19         152.53
2020-02-18         152.20
2020-02-13         152.28
[302 rows x 1 columns]

and I got new df with duplicates (2020-02-13). However when running

>>>pd.concat([df1, df2]).drop_duplicates()
                   Val
Date                
2018-02-20         141.40
2018-02-21         141.37
2018-02-22         141.17
2018-02-26         141.35
2018-02-27         140.69
...                   ...
2020-02-06         152.20
2020-02-10         152.03
2020-02-11         151.19
2020-02-13         152.28
2020-02-20         152.50
[299 rows x 1 columns]

it remove the dumplicates, but also some values (2020-02-18 and 2020-02-19). Any idea why ? and what is the correct why to concatenate 2 dataframes indexed by date ?

Sample:

print (df1)
               Val
Date              
2020-02-20  152.50
2020-02-19  152.53
2020-02-18  152.20
2020-02-13  152.28

print (df2)
               Val
Date              
2018-02-20  152.53
2018-02-21  141.37
2020-02-13  152.28

If join together:

print (pd.concat([df1, df2]))
               Val
Date              
2020-02-20  152.50
2020-02-19  152.53
2020-02-18  152.20
2020-02-13  152.28
2018-02-20  152.53
2018-02-21  141.37
2020-02-13  152.28

Your solution remove only duplicates by all columns, here Val column, index is not tested:

df3 = pd.concat([df1, df2]).drop_duplicates()
print (df3)
               Val
Date              
2020-02-20  152.50
2020-02-19  152.53 <-dupe
2020-02-18  152.20
2020-02-13  152.28 <-dupe
2018-02-21  141.37

If convert DatetimeIndex to column, it remove duplcicates by all columns, here Date and column Val :

df4 =  pd.concat([df1, df2]).reset_index().drop_duplicates()
print (df4)
        Date     Val
0 2020-02-20  152.50
1 2020-02-19  152.53 <-not dupe, different datetime
2 2020-02-18  152.20
3 2020-02-13  152.28 <-dupe
4 2018-02-20  152.53 <-not dupe, different datetime
5 2018-02-21  141.37

If need remove duplicates by DatetimeIndex only use

df5 = pd.concat([df1, df2])
df5 = df5[~df5.index.duplicated()]
print (df5)
Date              
2020-02-20  152.50
2020-02-19  152.53
2020-02-18  152.20
2020-02-13  152.28 <-dupe
2018-02-20  152.53
2018-02-21  141.37

Or remove duplicates by column Date specified in subset parameter:

df51 = pd.concat([df1, df2]).reset_index().drop_duplicates(subset=['Date'])
print (df51)
        Date     Val
0 2020-02-20  152.50
1 2020-02-19  152.53
2 2020-02-18  152.20
3 2020-02-13  152.28 <-dupe
4 2018-02-20  152.53
5 2018-02-21  141.37

Does the verify_integrity option of pandas' concat method makes the trick? In you case, it would look like:

df = pd.concat([df1, df2], verify_integrity=true)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM