I'm trying to read a series of times, which has some gaps so I'm trying to fill the gaps. I've done this before, but with this data set pandas seems to be seeing what seem to be unique datetimes as duplicates.
When I try reading the csv without assigning index or parsing dates, then check duplicates it shows none:
import pandas as pd
import numpy as np
df = pd.read_csv("/home/dewy/Desktop/euro/strip.csv",names=['time','open','high','low','close','volume'])#, index_col='time', parse_dates=True)
df[df.duplicated()]
The output is:
time open high low close volume
a blank table.
When I check the duplicates just for 'time'
df[df.duplicated(subset='time')]
I get two duplicates, though it seems to be saying 3:59 is equal to 4:00.
time open high low close volume
1255854 2012-11-21 03:59:00 1.27703 1.27703 1.27672 1.27672 2
1255855 2012-11-21 04:00:00 1.27666 1.27669 1.27531 1.27537 1211
and when I read_csv and name the index and parse_dates more duplicates appear
df = pd.read_csv("/home/dewy/Desktop/euro/strip.csv",names=['time','open','high','low','close','volume'], index_col='time', parse_dates=True)
df[df.duplicated()]
[output]:
open high low close volume
time
2009-05-01 04:01:00 1.32549 1.32549 1.32547 1.32548 3
2009-05-03 21:57:00 1.32827 1.32827 1.32827 1.32827 2
2009-05-05 22:33:00 1.33155 1.33155 1.33150 1.33155 3
2009-05-07 21:24:00 1.33976 1.33980 1.33976 1.33980 2
...
2014-02-21 05:35:00 1.37179 1.37179 1.37179 1.37179 3
2014-02-21 08:48:00 1.37125 1.37125 1.37117 1.37117 18
2014-02-21 11:12:00 1.37089 1.37093 1.37089 1.37093 12
2014-02-21 19:37:00 1.37409 1.37409 1.37409 1.37409 2
all together there are 2837 rows of duplicates.
This is the same thing that happens if I first import without naming the index and parsing dates, then set to_datetime and set_index after.
Just seems to be behaving oddly to me, any ideas? Thanks
the output of df.duplicated by default will keep the first instance and only return the other duplicates. when you do it for time and you get two duplicates its not saying these two are duplciates of each other. Its saying these two records have already been seen so these two are duplicates of 2 other records. try setting keep=False
in the call to duplciates if you want to see all duplicated records.
import pandas as pd
names = ['chris','adam','chris','sam','adam','david']
df = pd.DataFrame(names)
print(df)
print(df[df.duplicated()])
print(df[df.duplicated(keep=False)])
PRINT1 - the whole data frame
0
0 chris
1 adam
2 chris
3 sam
4 adam
5 david
PRINT2 - df.duplicate() by default using first. this is not saying chris is a duplciate of adam. its saying chris and adam have already been seen
0
2 chris
4 adam
PRINT3 - passing keep=False to df.dupicates so that we see all records which have a duplciate
0
0 chris
1 adam
2 chris
4 adam
It seems like pandas is behaving just as it should be. See DataFrame.duplicated for details.
df[df.duplicated()]
checks all cells. If every row is different in even just one cell each, we expect to get no duplicates.
When you call df.duplicated(subset="time")
pandas uses the option keep="first"
by default. Use keep=False
to view all duplicates. This should solve your problem.
After you set the index to time
, df.duplicated
looks only at your columns ( open
, high
, low
, close
, volume
), not at the index ( time
), this should explain the 2837
duplicates.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.