[英]How to extract the timestamps whenever an e.g. categorical pandas time series changes state
I recently had the problem where a pandas time series contained a signal that could take several states, and I was interested in the start and end timestamps of each state so that I could construct timeslots for each event.我最近遇到了一个问题,即 pandas 时间序列包含一个可能需要多个状态的信号,我对每个 state 的开始和结束时间戳感兴趣,以便我可以为每个事件构建时隙。 The input signal was a Pandas Series with a Timestamp index, and values could either be integers (eg numerical representation of a category) or NaN.输入信号是带有时间戳索引的 Pandas 系列,值可以是整数(例如类别的数字表示)或 NaN。 For NaN, I could assume that there had been no state change since the last state ( ffill
would basically fix this) and that the state change happened exactly when it was logged (so the plot actually ought to be a step chart, not linearly interpolated as illustrated below). For NaN, I could assume that there had been no state change since the last state ( ffill
would basically fix this) and that the state change happened exactly when it was logged (so the plot actually ought to be a step chart, not linearly interpolated如下图所示)。
Since timeslots are defined by their start time and the end time, I am interested in a method that can extract the pairs of (start time, end time)
for the timeslots illustrated at the bottom of the figure.由于时隙是由它们的开始时间和结束时间定义的,因此我对一种可以提取图底部所示时隙的(start time, end time)
对的方法很感兴趣。
Data:数据:
import pandas as pd
data = [2,2,2,1,2,np.nan,np.nan,1,3,3,1,1,np.nan,
2,1,np.nan,3,3,3,2,3,np.nan,3,1,2,1,3,3,1,
np.nan,1,1,2,1,3,1,2,np.nan,2,1]
s = pd.Series(data=data, index=pd.date_range(start='1/1/2020', freq='S', periods=40))
Ok, so this is the method I came up with.好的,这就是我想出的方法。 If anyone has a more efficient or elegant approach, please share.如果有人有更有效或更优雅的方法,请分享。
import numpy as np
import pandas as pd
# Create the example Pandas Time Series
data = [2,2,2,1,2,np.nan,np.nan,1,3,3,1,1,np.nan,2,1,np.nan,3,3,3,2,3,np.nan,3,1,2,1,3,3,1,np.nan,1,1,2,1,3,1,2,np.nan,2,1]
dt = pd.date_range(start='1/1/2020', freq='S', periods=40)
s = pd.Series(data=data, index=dt)
# Drop NAN and calculate the state changes (not changing states returns 0)
s_diff = s.dropna().diff()
# Since 0 means no state change, remove them
s_diff = s_diff.replace(0,np.nan).dropna()
# Create a series that start with the time serie's initial condition, and then just the state change differences between the next states.
s_diff = pd.concat([s[:1], s_diff])
# We can now to a cumulative sum that starts on the initial value and adds the changes to find the actual states
s_states = s_diff.cumsum().astype(int)
# If the signal does not change in during the last timestamp, we need to ensure that we still get it.
s_states[s.index[-1]] = int(s[-1])
# Extract pairs of (start, end) timestamps for defining the timeslots. The .strftime method is only applied for readability. The following would probably be more useful:
# [(s_states.index[i], s_states.index[i+1] for i in range(len(s_states)-1)]
[(s_states.index[i].strftime('%M:%S'), s_states.index[i+1].strftime('%M:%S')) for i in range(len(s_states)-1)]
Out:
[('00:00', '00:03'),
('00:03', '00:04'),
('00:04', '00:07'),
('00:07', '00:08'),
('00:08', '00:10'),
('00:10', '00:13'),
('00:13', '00:14'),
('00:14', '00:16'),
('00:16', '00:19'),
('00:19', '00:20'),
('00:20', '00:23'),
('00:23', '00:24'),
('00:24', '00:25'),
('00:25', '00:26'),
('00:26', '00:28'),
('00:28', '00:32'),
('00:32', '00:33'),
('00:33', '00:34'),
('00:34', '00:35'),
('00:35', '00:36'),
('00:36', '00:39')]
Here's a slightly more compact method.这是一个稍微紧凑的方法。 We'll create a label for each group and then use groupby
to determine where that group starts.我们将为每个组创建一个 label,然后使用groupby
来确定该组的开始位置。 To form these groups ffill
to deal with NaN, take the difference and check where that's not 0 (ie it changes to any state).要形成这些组来ffill
处理 NaN,请获取差异并检查不为 0 的位置(即它更改为任何状态)。 A cumsum of this Boolean Series forms the groups.此 Boolean 系列 forms 组的 cumsum。 Since the next group has to start when the previous group ends we shift
to get the end times.由于下一组必须在上一组结束时开始,我们shift
以获取结束时间。
gps = s.ffill().diff().fillna(0).ne(0).cumsum()
df = s.reset_index().groupby(gps.to_numpy()).agg(start=('index', 'min'))
df['stop'] = df['start'].shift(-1)
print(df.apply(lambda x: x.dt.strftime('%M:%S')))
## If you want a list of tuples:
# [tuple(zip(df['start'].dt.strftime('%M:%S'), df['stop'].dt.strftime('%M:%S')))]
start stop
0 00:00 00:03
1 00:03 00:04
2 00:04 00:07
3 00:07 00:08
4 00:08 00:10
5 00:10 00:13
6 00:13 00:14
7 00:14 00:16
8 00:16 00:19
9 00:19 00:20
10 00:20 00:23
11 00:23 00:24
12 00:24 00:25
13 00:25 00:26
14 00:26 00:28
15 00:28 00:32
16 00:32 00:33
17 00:33 00:34
18 00:34 00:35
19 00:35 00:36
20 00:36 00:39
21 00:39 NaT # Drop the last row if you don't want this
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.