简体   繁体   English

屏蔽 / pivot / 重塑 pandas Dataframe 以创建新的时间戳

[英]Masking / pivot / reshape pandas Dataframe to create new timestamps

I have a dataframe like this:我有一个像这样的 dataframe:

df = pd.DataFrame({'Start':['2022-06-07 06:24:48','2022-06-07 14:37:16','2022-06-07 08:00:59', '2022-06-07 17:06:55','2022-06-07 06:02:41', '2022-06-07 13:03:17', '2022-06-07 05:02:01'],
'End':['2022-06-07 14:07:00','2022-06-07 21:51:21','2022-06-07 13:18:34','2022-06-07 22:14:35','2022-06-07 10:56:35', '2022-06-07 17:20:08', '2022-06-07 23:32:42'],
'Process':['PROD','PROD','VORG','VORG','NCPNA','NCPNA','STO'], 
'Value':['','','FAUF1','FAUF2','PROG1','PROG2','ERR1'],
'Duration Min':[462,434,318,308,294,257,1110]})

I would like to create events that are dependent on the "Process=PROD" events and are based on its start and end timestamps.我想创建依赖于“Process=PROD”事件并基于其开始和结束时间戳的事件。 Depending on whether these timestamps are before, between or after the "Process=PROD" events.取决于这些时间戳是在“Process=PROD”事件之前、之间还是之后。

So that I get the following output:这样我得到以下output:

                  Start                  End Process  Value  Duration Min   Marker
0   2022-06-07 06:24:48  2022-06-07 14:07:00    PROD                  462  Orginal
1   2022-06-07 14:37:16  2022-06-07 21:51:21    PROD                  434  Orginal
2   2022-06-07 08:00:59  2022-06-07 13:18:34    VORG  FAUF1           318  Orginal
3   2022-06-07 17:06:55  2022-06-07 22:14:35    VORG  FAUF2           308  Orginal
4   2022-06-07 06:02:41  2022-06-07 10:56:35   NCPNA  PROG1           294  Orginal
5   2022-06-07 13:03:17  2022-06-07 17:20:08   NCPNA  PROG2           257  Orginal
6   2022-06-07 05:02:01  2022-06-07 23:32:42     STO   ERR1          1110  Orginal
7   2022-06-07 08:00:59  2022-06-07 13:18:34    VORG  FAUF1           318     PROD
8   2022-06-07 17:06:55  2022-06-07 21:51:21    VORG  FAUF2           284     PROD
9   2022-06-07 06:24:48  2022-06-07 10:56:35   NCPNA  PROG1           271     PROD
10  2022-06-07 13:03:17  2022-06-07 14:07:00   NCPNA  PROG1            63     PROD
11  2022-06-07 14:37:16  2022-06-07 17:20:08   NCPNA  PROG2           162     PROD
12  2022-06-07 06:24:48  2022-06-07 14:07:00     STO   ERR1           462     PROD
13  2022-06-07 14:37:16  2022-06-07 21:51:21     STO   ERR1           434     PROD

here is a picture of what i actually mean:这是我实际意思的图片:

在此处输入图像描述

IIUC, you could use merge_asof to cut your intervals: IIUC,您可以使用merge_asof来缩短间隔:

df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
m = df['Process'].eq('PROD')
df1 = (pd.merge_asof(df[~m].sort_values(by='End'),
                     df[m].sort_values(by='Start')[['Start', 'End']],
                     left_on='End', right_on='Start')
         .assign(**{'Start': lambda d: d[['Start_x', 'Start_y']].max(1),
                    'End': lambda d: d[['End_x', 'End_y']].min(1),
                    'Duration Min': lambda d: d['End'].sub(d['Start']).dt.total_seconds().div(60)
                   }
                )
      )

df2 = (pd.merge_asof(df[~m].sort_values(by='Start'),
                     df[m].sort_values(by='End')[['Start', 'End']],
                     left_on='Start', right_on='End', direction='forward')
         .assign(**{'Start': lambda d: d[['Start_x', 'Start_y']].min(1),
                      'End': lambda d: d[['End_x', 'End_y']].max(1),
                    'Duration Min': lambda d: d['End'].sub(d['Start']).dt.total_seconds().div(60)
                   }
                )
      )

out = (pd.concat([df.assign(Marker='Original'), df1, df2])
         .drop(columns=['Start_x', 'End_x', 'Start_y', 'End_y']).drop_duplicates()
         .fillna({'Marker': 'PROD'})
      )

output: output:

                Start                 End Process  Value  Duration Min    Marker
0 2022-06-07 06:24:48 2022-06-07 14:07:00    PROD           462.000000  Original
1 2022-06-07 14:37:16 2022-06-07 21:51:21    PROD           434.000000  Original
2 2022-06-07 08:00:59 2022-06-07 13:18:34    VORG  FAUF1    318.000000  Original
3 2022-06-07 17:06:55 2022-06-07 22:14:35    VORG  FAUF2    308.000000  Original
4 2022-06-07 06:02:41 2022-06-07 10:56:35   NCPNA  PROG1    294.000000  Original
5 2022-06-07 13:03:17 2022-06-07 17:20:08   NCPNA  PROG2    257.000000  Original
6 2022-06-07 05:02:01 2022-06-07 23:32:42     STO   ERR1   1110.000000  Original
0 2022-06-07 06:24:48 2022-06-07 10:56:35   NCPNA  PROG1    271.783333      PROD
1 2022-06-07 08:00:59 2022-06-07 13:18:34    VORG  FAUF1    317.583333      PROD
2 2022-06-07 14:37:16 2022-06-07 17:20:08   NCPNA  PROG2    162.866667      PROD
3 2022-06-07 17:06:55 2022-06-07 21:51:21    VORG  FAUF2    284.433333      PROD
4 2022-06-07 14:37:16 2022-06-07 21:51:21     STO   ERR1    434.083333      PROD
0 2022-06-07 05:02:01 2022-06-07 23:32:42     STO   ERR1   1110.683333      PROD
1 2022-06-07 06:02:41 2022-06-07 14:07:00   NCPNA  PROG1    484.316667      PROD
2 2022-06-07 06:24:48 2022-06-07 14:07:00    VORG  FAUF1    462.200000      PROD
3 2022-06-07 06:24:48 2022-06-07 17:20:08   NCPNA  PROG2    655.333333      PROD
4 2022-06-07 14:37:16 2022-06-07 22:14:35    VORG  FAUF2    457.316667      PROD

Unfortunately, the dataframe out still contains a few duplicates that should be removed.不幸的是,dataframe out仍然包含一些应该删除的重复项。

I tried an iteration, but I have problems that the adjusted events are overwritten again and again so no further events are created, if an orginal event has to be cut twice.我尝试了一次迭代,但我遇到的问题是调整后的事件会一次又一次地被覆盖,因此如果必须剪切两次原始事件,则不会创建更多事件。

m = df.loc[df['Process'] == 'PROD']
for index, row in m.iterrows():

        start = row["Start"]
        ende = row["End"]

        df.loc[(df['Process'] != 'PROD') & (df['Start'] < start) & (df['End'] < ende) & (df['End'] > start),['Marker','Start_x', 'Ende_x']]  = ["PROD", start, np.NaN ]
        df.loc[(df['Process'] != 'PROD') & (df['Start'] < start) & (df['End'] > ende),['Marker','Start_x', 'Ende_x']]  = ["PROD", start, ende ]
        df.loc[(df['Process'] != 'PROD') & (df['Start'] > start) & (df['End'] < ende),['Marker','Start_x', 'Ende_x']]  = ["PROD", np.NaN, np.NaN]
        df.loc[(df['Process'] != 'PROD') & (df['Start'] > start) & (df['End'] > ende) & (df['Start'] < ende),['Marker','Start_x', 'Ende_x']]  = ["PROD", np.NaN , ende ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM