I have 2 dataframes(some values are duplicated eg 2020-02-13):
>>> print(df1)
Val
Date
2020-02-20 152.50
2020-02-19 152.53
2020-02-18 152.20
2020-02-13 152.28
>>> print(fd2)
Val
Date
2018-02-20 141.40
2018-02-21 141.37
2018-02-22 141.17
2018-02-26 141.35
2018-02-27 140.69
... ...
2020-02-05 152.37
2020-02-06 152.20
2020-02-10 152.03
2020-02-11 151.19
2020-02-13 152.28
[298 rows x 1 columns]
both are indexed by Date (df1.set_index('Date')), and both dataframes dates were parsed (pd.to_datetime(df1.index)). Now I want to concate them both and remove duplicates (if any). I have tried
>>> pd.concat([df1, df2])
Val
Date
2018-02-20 141.40
2018-02-21 141.37
2018-02-22 141.17
2018-02-26 141.35
2018-02-27 140.69
... ...
2020-02-13 152.28
2020-02-20 152.50
2020-02-19 152.53
2020-02-18 152.20
2020-02-13 152.28
[302 rows x 1 columns]
and I got new df with duplicates (2020-02-13). However when running
>>>pd.concat([df1, df2]).drop_duplicates()
Val
Date
2018-02-20 141.40
2018-02-21 141.37
2018-02-22 141.17
2018-02-26 141.35
2018-02-27 140.69
... ...
2020-02-06 152.20
2020-02-10 152.03
2020-02-11 151.19
2020-02-13 152.28
2020-02-20 152.50
[299 rows x 1 columns]
it remove the dumplicates, but also some values (2020-02-18 and 2020-02-19). Any idea why ? and what is the correct why to concatenate 2 dataframes indexed by date ?
Sample:
print (df1)
Val
Date
2020-02-20 152.50
2020-02-19 152.53
2020-02-18 152.20
2020-02-13 152.28
print (df2)
Val
Date
2018-02-20 152.53
2018-02-21 141.37
2020-02-13 152.28
If join together:
print (pd.concat([df1, df2]))
Val
Date
2020-02-20 152.50
2020-02-19 152.53
2020-02-18 152.20
2020-02-13 152.28
2018-02-20 152.53
2018-02-21 141.37
2020-02-13 152.28
Your solution remove only duplicates by all columns, here Val
column, index is not tested:
df3 = pd.concat([df1, df2]).drop_duplicates()
print (df3)
Val
Date
2020-02-20 152.50
2020-02-19 152.53 <-dupe
2020-02-18 152.20
2020-02-13 152.28 <-dupe
2018-02-21 141.37
If convert DatetimeIndex
to column, it remove duplcicates by all columns, here Date
and column Val
:
df4 = pd.concat([df1, df2]).reset_index().drop_duplicates()
print (df4)
Date Val
0 2020-02-20 152.50
1 2020-02-19 152.53 <-not dupe, different datetime
2 2020-02-18 152.20
3 2020-02-13 152.28 <-dupe
4 2018-02-20 152.53 <-not dupe, different datetime
5 2018-02-21 141.37
If need remove duplicates by DatetimeIndex
only use
df5 = pd.concat([df1, df2])
df5 = df5[~df5.index.duplicated()]
print (df5)
Date
2020-02-20 152.50
2020-02-19 152.53
2020-02-18 152.20
2020-02-13 152.28 <-dupe
2018-02-20 152.53
2018-02-21 141.37
Or remove duplicates by column Date
specified in subset
parameter:
df51 = pd.concat([df1, df2]).reset_index().drop_duplicates(subset=['Date'])
print (df51)
Date Val
0 2020-02-20 152.50
1 2020-02-19 152.53
2 2020-02-18 152.20
3 2020-02-13 152.28 <-dupe
4 2018-02-20 152.53
5 2018-02-21 141.37
Does the verify_integrity
option of pandas' concat method makes the trick? In you case, it would look like:
df = pd.concat([df1, df2], verify_integrity=true)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.