简体   繁体   English

使用熊猫的日期时间范围

[英]Bin datetime range using pandas

I have a dataframe that consists of a operation id and a datetime stamp for the Start and End of an event.我有一个数据帧,它包含一个操作 id 和一个事件开始和结束的日期时间戳。

OperID               Start                 End
   141 2014-03-04 19:28:39 2014-03-04 19:33:38
 10502 2014-03-04 02:26:26 2014-03-08 20:09:21
 10502 2014-03-15 00:03:45 2014-03-15 10:03:44

I would like to take this data and be able to easily create bins of various types (month, day, hour) that show how long, within each bin, the operation was in the affected state.我想获取这些数据并能够轻松创建各种类型(月、日、小时)的箱,以显示每个箱内操作处于受影响状态的时间。 The Start and End dates often span across hour, day, and month boundaries.开始和结束日期通常跨越小时、日和月边界。

My desired output, if I was binning by day, would look like:我想要的输出,如果我白天分箱,看起来像:

OperID  Bin         Seconds
   141  2014-03-04  299
 10502  2014-03-04  77614
 10502  2014-03-05  86400
 10502  2014-03-06  86400
 10502  2014-03-07  86400
 10502  2014-03-08  72561
 10502  2014-03-15  35999

This is a quite verbose solution, the loop is hard to get rid of:这是一个相当冗长的解决方案,循环很难摆脱:

Creating new columns创建新列

from collections import OrderedDict

df['End_d']=pd.DatetimeIndex(df['End']).day
df['Start_d']=pd.DatetimeIndex(df['Start']).day

print(df)

   OperID               Start                 End  End_d  Start_d
0     141 2014-03-04 19:28:39 2014-03-04 19:33:38      4        4
1   10502 2014-03-04 02:26:26 2014-03-08 20:09:21      8        4
2   10502 2014-03-15 00:03:45 2014-03-15 10:03:44     15       15
    
[3 rows x 5 columns]

df.dtypes

OperID              int64
Start      datetime64[ns]
End        datetime64[ns]
End_d               int32
Start_d             int32
dtype: object

The bulk of the code:大部分代码:

df1 = df[df.End_d==df.Start_d].loc[:,['OperID', 'Start','End']]  #the obs. of which the duration < 1day
df2 = df[df.End_d!=df.Start_d]                                   #the obs. of which the duration > 1day
for i in df2.index:   #Expand it in to multiple rows.
    days=df2.loc[i,:].End_d-df2.loc[i,:].Start_d+1
    start_d_str=df2.loc[i,:].Start.strftime('%Y-%m-%d')

    temp_df=pd.DataFrame(OrderedDict({'OperID': df2.loc[i,:].OperID,
              'Start': pd.date_range('%s 00:00:00'%start_d_str, periods=days),
              'End':   pd.date_range('%s 23:59:59'%start_d_str, periods=days)}))

    temp_df.loc[0,'Start'] = df2.loc[i,'Start']
    temp_df.loc[days-1, 'End'] = df2.loc[i,'End']
    df1=df1.append(temp_df)
df1['Bin']=pd.DatetimeIndex(df1.Start.apply(lambda x: x.strftime('%Y-%m-%d')))   #Get the YMD only
df1['Seconds']=(df1['End']-df1['Start'])/np.timedelta64(1,'s')                   #Convert to seconds
df1.sort(columns=['OperID', 'Start'], ascending=[-1,-1], inplace=True)

Printing our results with print(df1)print(df1)打印我们的结果

                  End  OperID               Start        Bin  Seconds
0 2014-03-04 19:33:38     141 2014-03-04 19:28:39 2014-03-04      299
0 2014-03-04 23:59:59   10502 2014-03-04 02:26:26 2014-03-04    77613
1 2014-03-05 23:59:59   10502 2014-03-05 00:00:00 2014-03-05    86399
2 2014-03-06 23:59:59   10502 2014-03-06 00:00:00 2014-03-06    86399
3 2014-03-07 23:59:59   10502 2014-03-07 00:00:00 2014-03-07    86399
4 2014-03-08 20:09:21   10502 2014-03-08 00:00:00 2014-03-08    72561
2 2014-03-15 10:03:44   10502 2014-03-15 00:03:45 2014-03-15    35999
    
[7 rows x 5 columns]

Also if you count 1 days as 86400 seconds rather than 86299 seconds, aren't you counting the last seconds twice (in both days)?此外,如果您将 1 天计算为 86400 秒而不是 86299 秒,那么您是不是将最后几秒计算两次(两天内)? Minor issue anyway.反正小问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM