简体   繁体   中英

python: count hours between two date columns

Im new using stackoverflow

I want to calculate per id and month nbr of hours that an employee is off, so technically the hours between (end and beg) two timestamp, what is the best way to get it please.

import pandas as pd
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
   ,  'beg':['2021-01-01 00:00:00',
   '2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
   '2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
      'end':['2021-01-02 00:00:00 ',
   '2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
   '2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']}

Expected output

x1 01/2021  24
x1 02/2021  22
x2 02/2021    20
x2 03/2021     58
x2 04/2021 552
x2 05/2021 168(08/05/2021 = 24*7)

You can create hours by date_range and then aggregate counts by GroupBy.size :

df1 = pd.concat([pd.Series(r.id,pd.date_range(r.beg, r.end, freq='H', closed='left')) 
                                                for r in df.itertuples()]).reset_index()
df1.columns=['date','id']

df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')

Or use DataFrame.explode :

df['date'] = df.apply(lambda x: pd.date_range(x.beg, x.end,freq='H',closed='left'), axis=1)

df1 = df.explode('date')
df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
   id     date  count
0  x1  01/2021     24
1  x1  02/2021     22
2  x2  02/2021     20
3  x2  03/2021     58
4  x2  04/2021    720
5  x2  05/2021    168

EDIT:

Solution for better performance - is created Series in hours for differency, then repeat index values with DataFrame.loc and then add hours timedeltas to hours:

df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
dif = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
         
df = df.loc[df.index.repeat(dif)].copy()
df['date'] = df.beg + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='H')
print (df)
    

df1 = df.groupby(['id', df['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
   id     date  count
0  x1  01/2021     24
1  x1  02/2021     22
2  x2  02/2021     20
3  x2  03/2021     58
4  x2  04/2021    720
5  x2  05/2021    168

EDIT:

This solution should working in large DataFrame:

df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df['dif'] = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
      
N = 5000

out = [] 
for n, g in df.groupby(np.arange(len(df.index))/N):

    g = g.loc[g.index.repeat(g['dif'])].copy()
    g['date'] = g.beg + pd.to_timedelta(g.groupby(level=0).cumcount(), unit='H')
    s = g.groupby(['id', g['date'].dt.strftime('%m/%Y')]).size()
    out.append(s)

df = pd.concat(out).sum(level=[0,1]).reset_index(name='count')
print (df)

I hope the following code works for you:


import pandas as pd
import numpy as np
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
   ,  'beg':['2021-01-01 00:00:00',
   '2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
   '2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
      'end':['2021-01-02 00:00:00 ',
   '2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
   '2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']})
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df["difference"] = df.end - df.beg
print(df.difference/ np.timedelta64(1, 'h'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM