简体   繁体   中英

Counting a range of values IF they occur for a certain time interval

I have the following pandas dataframe set up to import from a csv:

df = pd.read_csv('file_path',
                 parse_dates={'timestamp': ['Date','Time']},
                 index_col='timestamp',
                 usecols=['Date', 'Time', 'X'],)

So it ends up having a datetime as the index and an int64 object 'X' for the value.

My data looks like this with two columns:

              X
timestamp   
2015-08-25 16:52:10 95
2015-08-25 16:52:12 84
2015-08-25 16:52:14 86
2015-08-25 16:52:16 84
2015-08-25 16:52:18 85
2015-08-25 16:52:20 86
2015-08-25 16:52:22 84
2015-08-25 16:52:24 95
2015-08-25 16:52:28 95
2015-08-25 16:52:48 80
2015-08-25 16:52:50 85
2015-08-25 16:52:52 85
2015-08-25 16:52:54 84
2015-08-25 16:52:56 85
2015-08-25 16:52:58 86
2015-08-25 16:53:00 85
2015-08-25 16:53:02 85
2015-08-25 16:53:04 85
2015-08-25 16:53:06 86
2015-08-25 16:53:08 85
2015-08-25 16:53:10 85

The interval isn't always consistent, however. Sometimes I have data points that are more than two seconds apart (ie 16:52:28-16:52:48).

My desired values are X = [84, 86] but ONLY IF they occur for at least 10 continuous seconds.

So in my dataframe, I would want python to only return a count of 2 for 16:52:12-22 and 16:52:50-16:53:10.

How do I tell python to not count 16:52:50-16:53:10 as 2? I can code for a specific time interval, but how do I translate "at least Y continuous seconds" into python?

Thanks in advance.

EDIT: To clarify, my preferred output would be a count of how many times Event Y occurs within a sample set. Event Y occurs when X has a value for at least 10 consecutive seconds. So for example, if X is at 84-86 for at least 10 consecutive seconds, then I would want that to be a count of 1.

I'm not sure of exactly what you want to do, but I give you an answer at least to help to clarify the expectations.

# Test data    
df = pd.DataFrame([('2015-08-25 16:52:10', 95),
  ('2015-08-25 16:52:12', 84),
  ('2015-08-25 16:52:14', 86),
  ('2015-08-25 16:52:16', 84),
  ('2015-08-25 16:52:18', 85),
  ('2015-08-25 16:52:20', 86),
  ('2015-08-25 16:52:22', 84),
  ('2015-08-25 16:52:24', 95),
  ('2015-08-25 16:52:28', 95),
  ('2015-08-25 16:52:48', 80),
  ('2015-08-25 16:52:50', 85),
  ('2015-08-25 16:52:52', 85),
  ('2015-08-25 16:52:54', 84),
  ('2015-08-25 16:52:56', 85),
  ('2015-08-25 16:52:58', 86),
  ('2015-08-25 16:53:00', 85),
  ('2015-08-25 16:53:02', 85),
  ('2015-08-25 16:53:04', 85),
  ('2015-08-25 16:53:06', 86),
  ('2015-08-25 16:53:08', 85),
  ('2015-08-25 16:53:10', 85)],
                 columns=['timestamp', 'x'])

df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.set_index('timestamp')

# Define a period column to indicate the period when the values occur
new = df.groupby(pd.TimeGrouper('10s'),as_index=False).apply(lambda x: x['x'])
df['period'] = new.index.get_level_values(0)
# Group by period and value and count the number of values to see the distinct values and how many time they occur by period
df = df.reset_index()
grouped = df.groupby(['period','x']).count()
print(grouped.head(10))

           timestamp
period x            
0      84          2
       85          1
       86          1
       95          1
1      84          1
       86          1
       95          2
3      80          1
4      84          1
       85          3

Given your example:

>>> df
             timestamp   x
0  2015-08-25 16:52:10  95
1  2015-08-25 16:52:12  84
2  2015-08-25 16:52:14  86
3  2015-08-25 16:52:16  84
4  2015-08-25 16:52:18  85
5  2015-08-25 16:52:20  86
6  2015-08-25 16:52:22  84
7  2015-08-25 16:52:24  95
8  2015-08-25 16:52:28  95
9  2015-08-25 16:52:48  80
10 2015-08-25 16:52:50  85
11 2015-08-25 16:52:52  85
12 2015-08-25 16:52:54  84
13 2015-08-25 16:52:56  85
14 2015-08-25 16:52:58  86
15 2015-08-25 16:53:00  85
16 2015-08-25 16:53:02  85
17 2015-08-25 16:53:04  85
18 2015-08-25 16:53:06  86
19 2015-08-25 16:53:08  85
20 2015-08-25 16:53:10  85

First, let's get a new column with the interval between two time stamps:

>>> tl=df['timestamp']
>>> df['interval']=[(tl[i+1]-tl[i]).total_seconds() for i, _ in enumerate(tl[:-1])]+[0]
>>> df
             timestamp   x  interval
0  2015-08-25 16:52:10  95         2
1  2015-08-25 16:52:12  84         2
2  2015-08-25 16:52:14  86         2
3  2015-08-25 16:52:16  84         2
4  2015-08-25 16:52:18  85         2
5  2015-08-25 16:52:20  86         2
6  2015-08-25 16:52:22  84         2
7  2015-08-25 16:52:24  95         4
8  2015-08-25 16:52:28  95        20
9  2015-08-25 16:52:48  80         2
10 2015-08-25 16:52:50  85         2
11 2015-08-25 16:52:52  85         2
12 2015-08-25 16:52:54  84         2
13 2015-08-25 16:52:56  85         2
14 2015-08-25 16:52:58  86         2
15 2015-08-25 16:53:00  85         2
16 2015-08-25 16:53:02  85         2
17 2015-08-25 16:53:04  85         2
18 2015-08-25 16:53:06  86         2
19 2015-08-25 16:53:08  85         2
20 2015-08-25 16:53:10  85         0

Now, use Python's groupby to get each interval span:

fmt='{} sec interval between {} and {} every {} seconds\n\tx={}, count={}\n'
for k, l in groupby(df.iterrows(), key=lambda row: row[1]['interval']):
    li=list(l)
    t2, t1=li[-1][1]['timestamp'], li[0][1]['timestamp']
    ti=(t2-t1).total_seconds()
    if ti>=10.0:
        data=[e[1]['x'] for e in li]
        print fmt.format(ti, t1, t2, k, data, Counter(data))

Prints:

12.0 sec interval between 2015-08-25 16:52:10 and 2015-08-25 16:52:22 every 2.0 seconds
    x=[95, 84, 86, 84, 85, 86, 84], count=Counter({84: 3, 86: 2, 85: 1, 95: 1})

20.0 sec interval between 2015-08-25 16:52:48 and 2015-08-25 16:53:08 every 2.0 seconds
    x=[80, 85, 85, 84, 85, 86, 85, 85, 85, 86, 85], count=Counter({85: 7, 86: 2, 80: 1, 84: 1})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM