简体   繁体   中英

How to remove index if time isn't part of a consecutive group?

If I have a pandas DataFrame with a column of date times like this:

 2018-01-16 01:51:10
 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:59    
 2018-01-16 01:57:59    
 2018-01-16 01:57:59    
 2018-01-16 01:58:00     
 2018-01-16 01:58:00    
 2018-01-16 01:58:01    
 2018-01-16 01:58:01
 2018-01-16 02:00:00
 2018-01-16 02:45:01    
 2018-01-16 02:45:01    
 2018-01-16 02:45:02    
 2018-01-16 02:45:02    
 2018-01-16 02:45:02    
 2018-01-16 02:45:02    
 2018-01-16 02:45:03    
 2018-01-16 02:45:04     
 2018-01-16 02:45:04    
 2018-01-16 02:45:05    
 2018-01-16 02:45:06    
 2018-01-16 02:45:07    
 2018-01-16 02:45:07    
 2018-01-16 02:45:08    
 2018-01-16 02:58:55    

How do I remove times that are not in groupings/outlier times such that if the time is more then 1 second away from a group of times it gets discarded. I want to get a pandas data frame like this:

 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:58    
 2018-01-16 01:57:59    
 2018-01-16 01:57:59    
 2018-01-16 01:57:59    
 2018-01-16 01:58:00     
 2018-01-16 01:58:00    
 2018-01-16 01:58:01    
 2018-01-16 01:58:01
 2018-01-16 02:45:01    
 2018-01-16 02:45:01    
 2018-01-16 02:45:02    
 2018-01-16 02:45:02    
 2018-01-16 02:45:02    
 2018-01-16 02:45:02    
 2018-01-16 02:45:03    
 2018-01-16 02:45:04     
 2018-01-16 02:45:04    
 2018-01-16 02:45:05    
 2018-01-16 02:45:06    
 2018-01-16 02:45:07    
 2018-01-16 02:45:07    
 2018-01-16 02:45:08    

Sample data:

import pandas as pd

idx = [
    "2018-01-16 01:51:10", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:59", "2018-01-16 01:57:59", "2018-01-16 01:57:59", "2018-01-16 01:58:00", "2018-01-16 01:58:00", "2018-01-16 01:58:01", "2018-01-16 01:58:01", "2018-01-16 02:00:00", "2018-01-16 02:45:01", "2018-01-16 02:45:01", "2018-01-16 02:45:02", "2018-01-16 02:45:02", "2018-01-16 02:45:02", "2018-01-16 02:45:02", "2018-01-16 02:45:03", "2018-01-16 02:45:04", "2018-01-16 02:45:04", "2018-01-16 02:45:05", "2018-01-16 02:45:06", "2018-01-16 02:45:07", "2018-01-16 02:45:07", "2018-01-16 02:45:08", "2018-01-16 02:58:55",
]

df = pd.DataFrame(range(len(idx)), index=idx, columns=["col"])
df.index = pd.to_datetime(df.index)

Take the first-differences in both directions:

bdiff = np.abs((df.index[1:] - df.index[:-1]).total_seconds())
fdiff = np.abs((df.index[:-1] - df.index[1:]).total_seconds())

Now, each of these is effectively one element too short; bdiff excludes the first element, while fdiff excludes the last. So, pad them:

bdiff = np.insert(bdiff, 0, 99.)
fdiff = np.append(fdiff, 99.)

Finally, form a boolean mask from this; the mask is True where a value is more than one second away from both the values that surround it:

>>> mask = (bdiff > 1.0) & (fdiff > 1.0) 
... print(df.loc[~mask])                                                                                                                                     
                     col
2018-01-16 01:57:58    1
2018-01-16 01:57:58    2
2018-01-16 01:57:58    3
2018-01-16 01:57:58    4
2018-01-16 01:57:58    5
2018-01-16 01:57:59    6
2018-01-16 01:57:59    7
2018-01-16 01:57:59    8
2018-01-16 01:58:00    9
2018-01-16 01:58:00   10
2018-01-16 01:58:01   11
2018-01-16 01:58:01   12
2018-01-16 02:45:01   14
2018-01-16 02:45:01   15
2018-01-16 02:45:02   16
2018-01-16 02:45:02   17
2018-01-16 02:45:02   18
2018-01-16 02:45:02   19
2018-01-16 02:45:03   20
2018-01-16 02:45:04   21
2018-01-16 02:45:04   22
2018-01-16 02:45:05   23
2018-01-16 02:45:06   24
2018-01-16 02:45:07   25
2018-01-16 02:45:07   26
2018-01-16 02:45:08   27

>>> mask                                                                                                                                                     
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM