[英]Pandas flatten a time-series dataframe on same activity but different timestamps
I'm looking to flatten certain processes.我希望扁平化某些流程。 Basically looking at duplicates that are right after each other.基本上是查看彼此紧随其后的重复项。 Let's say I have a dataframe:假设我有一个 dataframe:
d = {'time': [12-08-2020, 13-08-2020, 14-08-2020, 15-08-2020, 16-08-2020], 'state': [off, on, on, on, off]}
df = pd.DataFrame(data=d)
Then I would use time.shift()
to create the "time_end" column.然后我会使用time.shift()
来创建“time_end”列。 Basically the next rows time.基本上是下排时间。 result:结果:
time state time_end
0 12-08-2020 off 13-08-2020
1 13-08-2020 on 14-08-2020
2 14-08-2020 on 15-08-2020
3 15-08-2020 on 16-08-2020
4 16-08-2020 off NaN
My question is now, how do I flatten it so that it becomes in actuality 3 lines like this:我现在的问题是,如何将它展平,使其实际上变成这样的 3 行:
time state time_end
0 12-08-2020 off 13-08-2020
1 13-08-2020 on 16-08-2020
4 16-08-2020 off NaN
For my code I dont need repeat on's if they are followed by another on.对于我的代码,如果它们后面跟着另一个,我不需要重复。 Any help would be appreciated.任何帮助,将不胜感激。
We can get the grouping of consecutive same state
by .shift()
+ .ne()
+ .cumsum()
.我们可以通过.shift()
+ .ne()
+ .cumsum()
得到连续相同的state
的分组。
Then, for each group (of consecutive same state
), we get the first entry of time
and last entry of time_end
using .groupby()
+ .agg()
, as follows:然后,对于每个组(连续相同的state
),我们使用.groupby()
+ .agg()
获得time
的第一个条目和time_end
的最后一个条目,如下所示:
df['state_group'] = df['state'].ne(df['state'].shift()).cumsum()
df_out = df.groupby('state_group').agg({'time': 'first', 'state': 'first', 'time_end': 'last'}).reset_index(drop=True)
Result:结果:
print(df_out)
time state time_end
0 12-08-2020 off 13-08-2020
1 13-08-2020 on 16-08-2020
2 16-08-2020 off None
Just for information, the following interim dataframe is created with the grouping of consecutive same state
after the first line of codes above.仅供参考,以下临时 dataframe 是在上述第一行代码之后对连续相同的state
进行分组创建的。 We based on this grouping to aggregate the desired flattened result.我们基于此分组来聚合所需的扁平化结果。
time state time_end state_group
0 12-08-2020 off 13-08-2020 1
1 13-08-2020 on 14-08-2020 2
2 14-08-2020 on 15-08-2020 2
3 15-08-2020 on 16-08-2020 2
4 16-08-2020 off NaN 3
We can filter the DataFrame based on where the current row's state
value does not equal the next row's state
value, then create the time_end
column by shifting back the filtered time
column:我们可以根据当前行的state
值不等于下一行的state
值来过滤 DataFrame,然后通过向后移回过滤后的time
列来创建time_end
列:
import pandas as pd
df = pd.DataFrame(data={
'time': ['12-08-2020', '13-08-2020', '14-08-2020', '15-08-2020',
'16-08-2020'],
'state': ['off', 'on', 'on', 'on', 'off']
})
new_df = df[df['state'].ne(df['state'].shift())].reset_index(drop=True)
new_df['time_end'] = new_df['time'].shift(-1)
new_df
: new_df
:
time state time_end
0 12-08-2020 off 13-08-2020
1 13-08-2020 on 16-08-2020
2 16-08-2020 off NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.