简体   繁体   English

每当例如分类 pandas 时间序列更改 state 时如何提取时间戳

[英]How to extract the timestamps whenever an e.g. categorical pandas time series changes state

I recently had the problem where a pandas time series contained a signal that could take several states, and I was interested in the start and end timestamps of each state so that I could construct timeslots for each event.我最近遇到了一个问题,即 pandas 时间序列包含一个可能需要多个状态的信号,我对每个 state 的开始和结束时间戳感兴趣,以便我可以为每个事件构建时隙。 The input signal was a Pandas Series with a Timestamp index, and values could either be integers (eg numerical representation of a category) or NaN.输入信号是带有时间戳索引的 Pandas 系列,值可以是整数(例如类别的数字表示)或 NaN。 For NaN, I could assume that there had been no state change since the last state ( ffill would basically fix this) and that the state change happened exactly when it was logged (so the plot actually ought to be a step chart, not linearly interpolated as illustrated below). For NaN, I could assume that there had been no state change since the last state ( ffill would basically fix this) and that the state change happened exactly when it was logged (so the plot actually ought to be a step chart, not linearly interpolated如下图所示)。

Since timeslots are defined by their start time and the end time, I am interested in a method that can extract the pairs of (start time, end time) for the timeslots illustrated at the bottom of the figure.由于时隙是由它们的开始时间和结束时间定义的,因此我对一种可以提取图底部所示时隙的(start time, end time)对的方法很感兴趣。

输入信号和预期结果的图示

Data:数据:

import pandas as pd

data = [2,2,2,1,2,np.nan,np.nan,1,3,3,1,1,np.nan,
        2,1,np.nan,3,3,3,2,3,np.nan,3,1,2,1,3,3,1,
        np.nan,1,1,2,1,3,1,2,np.nan,2,1]
s = pd.Series(data=data, index=pd.date_range(start='1/1/2020', freq='S', periods=40))

Ok, so this is the method I came up with.好的,这就是我想出的方法。 If anyone has a more efficient or elegant approach, please share.如果有人有更有效或更优雅的方法,请分享。

import numpy as np
import pandas as pd

# Create the example Pandas Time Series
data = [2,2,2,1,2,np.nan,np.nan,1,3,3,1,1,np.nan,2,1,np.nan,3,3,3,2,3,np.nan,3,1,2,1,3,3,1,np.nan,1,1,2,1,3,1,2,np.nan,2,1]
dt = pd.date_range(start='1/1/2020', freq='S', periods=40)
s = pd.Series(data=data, index=dt)

# Drop NAN and calculate the state changes (not changing states returns 0)
s_diff = s.dropna().diff()

# Since 0 means no state change, remove them
s_diff = s_diff.replace(0,np.nan).dropna()

# Create a series that start with the time serie's initial condition, and then just the state change differences between the next states.
s_diff = pd.concat([s[:1], s_diff])

# We can now to a cumulative sum that starts on the initial value and adds the changes to find the actual states
s_states = s_diff.cumsum().astype(int)

# If the signal does not change in during the last timestamp, we need to ensure that we still get it.
s_states[s.index[-1]] = int(s[-1])

# Extract pairs of (start, end) timestamps for defining the timeslots. The .strftime method is only applied for readability. The following would probably be more useful:
# [(s_states.index[i], s_states.index[i+1] for i in range(len(s_states)-1)]
[(s_states.index[i].strftime('%M:%S'), s_states.index[i+1].strftime('%M:%S')) for i in range(len(s_states)-1)]
Out:
[('00:00', '00:03'),
 ('00:03', '00:04'),
 ('00:04', '00:07'),
 ('00:07', '00:08'),
 ('00:08', '00:10'),
 ('00:10', '00:13'),
 ('00:13', '00:14'),
 ('00:14', '00:16'),
 ('00:16', '00:19'),
 ('00:19', '00:20'),
 ('00:20', '00:23'),
 ('00:23', '00:24'),
 ('00:24', '00:25'),
 ('00:25', '00:26'),
 ('00:26', '00:28'),
 ('00:28', '00:32'),
 ('00:32', '00:33'),
 ('00:33', '00:34'),
 ('00:34', '00:35'),
 ('00:35', '00:36'),
 ('00:36', '00:39')]

Here's a slightly more compact method.这是一个稍微紧凑的方法。 We'll create a label for each group and then use groupby to determine where that group starts.我们将为每个组创建一个 label,然后使用groupby来确定该组的开始位置。 To form these groups ffill to deal with NaN, take the difference and check where that's not 0 (ie it changes to any state).要形成这些组来ffill处理 NaN,请获取差异并检查不为 0 的位置(即它更改为任何状态)。 A cumsum of this Boolean Series forms the groups.此 Boolean 系列 forms 组的 cumsum。 Since the next group has to start when the previous group ends we shift to get the end times.由于下一组必须在上一组结束时开始,我们shift以获取结束时间。

gps = s.ffill().diff().fillna(0).ne(0).cumsum()

df = s.reset_index().groupby(gps.to_numpy()).agg(start=('index', 'min'))
df['stop'] = df['start'].shift(-1)

Output Output

print(df.apply(lambda x: x.dt.strftime('%M:%S')))
## If you want a list of tuples:
# [tuple(zip(df['start'].dt.strftime('%M:%S'), df['stop'].dt.strftime('%M:%S')))]

    start   stop
0   00:00  00:03
1   00:03  00:04
2   00:04  00:07
3   00:07  00:08
4   00:08  00:10
5   00:10  00:13
6   00:13  00:14
7   00:14  00:16
8   00:16  00:19
9   00:19  00:20
10  00:20  00:23
11  00:23  00:24
12  00:24  00:25
13  00:25  00:26
14  00:26  00:28
15  00:28  00:32
16  00:32  00:33
17  00:33  00:34
18  00:34  00:35
19  00:35  00:36
20  00:36  00:39
21  00:39    NaT   # Drop the last row if you don't want this

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从时间序列数据中提取有用的功能(例如,用户在论坛中的日常活动) - How to extract useful features from time-series data (e.g., users' daily activities in a forum) Pandas:在给定时间(例如每一天)对插值时间序列数据进行采样的更简单方法 - Pandas: easier way to sample interpolated time series data at given times (e.g. every full day) 如何测试pandas.Series是否仅包含某些类型(例如int)? - How to test whether pandas.Series contains only certain type (e.g. int)? Plot 例如时间序列数据中一个月的最大值 - Plot e.g. max value of a month in time series data 如何从时间序列图中排除某些日期(例如,周末)? - How can I exclude certain dates (e.g., weekends) from time series plots? 如何进行时间序列反向重采样,例如从最后一个数据日期开始的 5 个工作日? - How to do time series backward resampling e.g. 5 business days starting on the last data date? 使用系列作为输入,如何在 Pandas 数据框中找到具有匹配值的行? 例如df.loc[系列]? - Using a series as input, how can I find rows with matching values in a pandas dataframe? e.g. df.loc[series]? 从时间序列中提取熊猫中的每月分类(虚拟)变量 - Extract monthly categorical (dummy) variables in pandas from a time series 以 "Seconds.microseconds" 格式获取时间戳 PYTHON 之间的时差 [例如 0.123456 秒] - Get Time Difference between Timestamps PYTHON in the format "Seconds.microseconds" [e.g. 0.123456 sec] 如何将功能“附加”到 Python 中的对象,例如 Pandas DataFrame? - How to "attach" functionality to objects in Python e.g. to pandas DataFrame?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM