简体   繁体   中英

Pandas Showing Unique Dates as Duplicates

I'm trying to read a series of times, which has some gaps so I'm trying to fill the gaps. I've done this before, but with this data set pandas seems to be seeing what seem to be unique datetimes as duplicates.

When I try reading the csv without assigning index or parsing dates, then check duplicates it shows none:

import pandas as pd
import numpy as np

df = pd.read_csv("/home/dewy/Desktop/euro/strip.csv",names=['time','open','high','low','close','volume'])#, index_col='time', parse_dates=True)
df[df.duplicated()]

The output is:

time    open    high    low     close   volume

a blank table.

When I check the duplicates just for 'time'

df[df.duplicated(subset='time')]

I get two duplicates, though it seems to be saying 3:59 is equal to 4:00.

                time                 open        high         low        close      volume
1255854     2012-11-21 03:59:00     1.27703     1.27703     1.27672     1.27672     2
1255855     2012-11-21 04:00:00     1.27666     1.27669     1.27531     1.27537     1211

and when I read_csv and name the index and parse_dates more duplicates appear

df = pd.read_csv("/home/dewy/Desktop/euro/strip.csv",names=['time','open','high','low','close','volume'], index_col='time', parse_dates=True)
df[df.duplicated()]

[output]:

                         open        high        low         close   volume
      time                  
2009-05-01 04:01:00     1.32549     1.32549     1.32547     1.32548     3
2009-05-03 21:57:00     1.32827     1.32827     1.32827     1.32827     2
2009-05-05 22:33:00     1.33155     1.33155     1.33150     1.33155     3
2009-05-07 21:24:00     1.33976     1.33980     1.33976     1.33980     2 
...
2014-02-21 05:35:00     1.37179     1.37179     1.37179     1.37179     3
2014-02-21 08:48:00     1.37125     1.37125     1.37117     1.37117     18
2014-02-21 11:12:00     1.37089     1.37093     1.37089     1.37093     12
2014-02-21 19:37:00     1.37409     1.37409     1.37409     1.37409     2

all together there are 2837 rows of duplicates.

This is the same thing that happens if I first import without naming the index and parsing dates, then set to_datetime and set_index after.

Just seems to be behaving oddly to me, any ideas? Thanks

the output of df.duplicated by default will keep the first instance and only return the other duplicates. when you do it for time and you get two duplicates its not saying these two are duplciates of each other. Its saying these two records have already been seen so these two are duplicates of 2 other records. try setting keep=False in the call to duplciates if you want to see all duplicated records.

import pandas as pd
names = ['chris','adam','chris','sam','adam','david']
df = pd.DataFrame(names)
print(df)
print(df[df.duplicated()])
print(df[df.duplicated(keep=False)])

PRINT1 - the whole data frame

       0
0  chris
1   adam
2  chris
3    sam
4   adam
5  david

PRINT2 - df.duplicate() by default using first. this is not saying chris is a duplciate of adam. its saying chris and adam have already been seen

       0
2  chris
4   adam

PRINT3 - passing keep=False to df.dupicates so that we see all records which have a duplciate

       0
0  chris
1   adam
2  chris
4   adam

It seems like pandas is behaving just as it should be. See DataFrame.duplicated for details.

1) Full duplicates:

df[df.duplicated()] checks all cells. If every row is different in even just one cell each, we expect to get no duplicates.

2) Time duplicates

When you call df.duplicated(subset="time") pandas uses the option keep="first" by default. Use keep=False to view all duplicates. This should solve your problem.

3) Time-Index duplicates

After you set the index to time , df.duplicated looks only at your columns ( open , high , low , close , volume ), not at the index ( time ), this should explain the 2837 duplicates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM