[英]Masking / pivot / reshape pandas Dataframe to create new timestamps
I have a dataframe like this:我有一个像这样的 dataframe:
df = pd.DataFrame({'Start':['2022-06-07 06:24:48','2022-06-07 14:37:16','2022-06-07 08:00:59', '2022-06-07 17:06:55','2022-06-07 06:02:41', '2022-06-07 13:03:17', '2022-06-07 05:02:01'],
'End':['2022-06-07 14:07:00','2022-06-07 21:51:21','2022-06-07 13:18:34','2022-06-07 22:14:35','2022-06-07 10:56:35', '2022-06-07 17:20:08', '2022-06-07 23:32:42'],
'Process':['PROD','PROD','VORG','VORG','NCPNA','NCPNA','STO'],
'Value':['','','FAUF1','FAUF2','PROG1','PROG2','ERR1'],
'Duration Min':[462,434,318,308,294,257,1110]})
I would like to create events that are dependent on the "Process=PROD" events and are based on its start and end timestamps.我想创建依赖于“Process=PROD”事件并基于其开始和结束时间戳的事件。 Depending on whether these timestamps are before, between or after the "Process=PROD" events.
取决于这些时间戳是在“Process=PROD”事件之前、之间还是之后。
So that I get the following output:这样我得到以下output:
Start End Process Value Duration Min Marker
0 2022-06-07 06:24:48 2022-06-07 14:07:00 PROD 462 Orginal
1 2022-06-07 14:37:16 2022-06-07 21:51:21 PROD 434 Orginal
2 2022-06-07 08:00:59 2022-06-07 13:18:34 VORG FAUF1 318 Orginal
3 2022-06-07 17:06:55 2022-06-07 22:14:35 VORG FAUF2 308 Orginal
4 2022-06-07 06:02:41 2022-06-07 10:56:35 NCPNA PROG1 294 Orginal
5 2022-06-07 13:03:17 2022-06-07 17:20:08 NCPNA PROG2 257 Orginal
6 2022-06-07 05:02:01 2022-06-07 23:32:42 STO ERR1 1110 Orginal
7 2022-06-07 08:00:59 2022-06-07 13:18:34 VORG FAUF1 318 PROD
8 2022-06-07 17:06:55 2022-06-07 21:51:21 VORG FAUF2 284 PROD
9 2022-06-07 06:24:48 2022-06-07 10:56:35 NCPNA PROG1 271 PROD
10 2022-06-07 13:03:17 2022-06-07 14:07:00 NCPNA PROG1 63 PROD
11 2022-06-07 14:37:16 2022-06-07 17:20:08 NCPNA PROG2 162 PROD
12 2022-06-07 06:24:48 2022-06-07 14:07:00 STO ERR1 462 PROD
13 2022-06-07 14:37:16 2022-06-07 21:51:21 STO ERR1 434 PROD
here is a picture of what i actually mean:这是我实际意思的图片:
IIUC, you could use merge_asof
to cut your intervals: IIUC,您可以使用
merge_asof
来缩短间隔:
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
m = df['Process'].eq('PROD')
df1 = (pd.merge_asof(df[~m].sort_values(by='End'),
df[m].sort_values(by='Start')[['Start', 'End']],
left_on='End', right_on='Start')
.assign(**{'Start': lambda d: d[['Start_x', 'Start_y']].max(1),
'End': lambda d: d[['End_x', 'End_y']].min(1),
'Duration Min': lambda d: d['End'].sub(d['Start']).dt.total_seconds().div(60)
}
)
)
df2 = (pd.merge_asof(df[~m].sort_values(by='Start'),
df[m].sort_values(by='End')[['Start', 'End']],
left_on='Start', right_on='End', direction='forward')
.assign(**{'Start': lambda d: d[['Start_x', 'Start_y']].min(1),
'End': lambda d: d[['End_x', 'End_y']].max(1),
'Duration Min': lambda d: d['End'].sub(d['Start']).dt.total_seconds().div(60)
}
)
)
out = (pd.concat([df.assign(Marker='Original'), df1, df2])
.drop(columns=['Start_x', 'End_x', 'Start_y', 'End_y']).drop_duplicates()
.fillna({'Marker': 'PROD'})
)
output: output:
Start End Process Value Duration Min Marker
0 2022-06-07 06:24:48 2022-06-07 14:07:00 PROD 462.000000 Original
1 2022-06-07 14:37:16 2022-06-07 21:51:21 PROD 434.000000 Original
2 2022-06-07 08:00:59 2022-06-07 13:18:34 VORG FAUF1 318.000000 Original
3 2022-06-07 17:06:55 2022-06-07 22:14:35 VORG FAUF2 308.000000 Original
4 2022-06-07 06:02:41 2022-06-07 10:56:35 NCPNA PROG1 294.000000 Original
5 2022-06-07 13:03:17 2022-06-07 17:20:08 NCPNA PROG2 257.000000 Original
6 2022-06-07 05:02:01 2022-06-07 23:32:42 STO ERR1 1110.000000 Original
0 2022-06-07 06:24:48 2022-06-07 10:56:35 NCPNA PROG1 271.783333 PROD
1 2022-06-07 08:00:59 2022-06-07 13:18:34 VORG FAUF1 317.583333 PROD
2 2022-06-07 14:37:16 2022-06-07 17:20:08 NCPNA PROG2 162.866667 PROD
3 2022-06-07 17:06:55 2022-06-07 21:51:21 VORG FAUF2 284.433333 PROD
4 2022-06-07 14:37:16 2022-06-07 21:51:21 STO ERR1 434.083333 PROD
0 2022-06-07 05:02:01 2022-06-07 23:32:42 STO ERR1 1110.683333 PROD
1 2022-06-07 06:02:41 2022-06-07 14:07:00 NCPNA PROG1 484.316667 PROD
2 2022-06-07 06:24:48 2022-06-07 14:07:00 VORG FAUF1 462.200000 PROD
3 2022-06-07 06:24:48 2022-06-07 17:20:08 NCPNA PROG2 655.333333 PROD
4 2022-06-07 14:37:16 2022-06-07 22:14:35 VORG FAUF2 457.316667 PROD
Unfortunately, the dataframe out
still contains a few duplicates that should be removed.不幸的是,dataframe
out
仍然包含一些应该删除的重复项。
I tried an iteration, but I have problems that the adjusted events are overwritten again and again so no further events are created, if an orginal event has to be cut twice.我尝试了一次迭代,但我遇到的问题是调整后的事件会一次又一次地被覆盖,因此如果必须剪切两次原始事件,则不会创建更多事件。
m = df.loc[df['Process'] == 'PROD']
for index, row in m.iterrows():
start = row["Start"]
ende = row["End"]
df.loc[(df['Process'] != 'PROD') & (df['Start'] < start) & (df['End'] < ende) & (df['End'] > start),['Marker','Start_x', 'Ende_x']] = ["PROD", start, np.NaN ]
df.loc[(df['Process'] != 'PROD') & (df['Start'] < start) & (df['End'] > ende),['Marker','Start_x', 'Ende_x']] = ["PROD", start, ende ]
df.loc[(df['Process'] != 'PROD') & (df['Start'] > start) & (df['End'] < ende),['Marker','Start_x', 'Ende_x']] = ["PROD", np.NaN, np.NaN]
df.loc[(df['Process'] != 'PROD') & (df['Start'] > start) & (df['End'] > ende) & (df['Start'] < ende),['Marker','Start_x', 'Ende_x']] = ["PROD", np.NaN , ende ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.