Im new using stackoverflow
I want to calculate per id and month nbr of hours that an employee is off, so technically the hours between (end and beg) two timestamp, what is the best way to get it please.
import pandas as pd
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
, 'beg':['2021-01-01 00:00:00',
'2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
'2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
'end':['2021-01-02 00:00:00 ',
'2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
'2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']}
Expected output
x1 01/2021 24
x1 02/2021 22
x2 02/2021 20
x2 03/2021 58
x2 04/2021 552
x2 05/2021 168(08/05/2021 = 24*7)
You can create hours by date_range
and then aggregate counts by GroupBy.size
:
df1 = pd.concat([pd.Series(r.id,pd.date_range(r.beg, r.end, freq='H', closed='left'))
for r in df.itertuples()]).reset_index()
df1.columns=['date','id']
df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
Or use DataFrame.explode
:
df['date'] = df.apply(lambda x: pd.date_range(x.beg, x.end,freq='H',closed='left'), axis=1)
df1 = df.explode('date')
df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
id date count
0 x1 01/2021 24
1 x1 02/2021 22
2 x2 02/2021 20
3 x2 03/2021 58
4 x2 04/2021 720
5 x2 05/2021 168
EDIT:
Solution for better performance - is created Series
in hours for differency, then repeat index
values with DataFrame.loc
and then add hours timedeltas to hours:
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
dif = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
df = df.loc[df.index.repeat(dif)].copy()
df['date'] = df.beg + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='H')
print (df)
df1 = df.groupby(['id', df['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
id date count
0 x1 01/2021 24
1 x1 02/2021 22
2 x2 02/2021 20
3 x2 03/2021 58
4 x2 04/2021 720
5 x2 05/2021 168
EDIT:
This solution should working in large DataFrame:
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df['dif'] = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
N = 5000
out = []
for n, g in df.groupby(np.arange(len(df.index))/N):
g = g.loc[g.index.repeat(g['dif'])].copy()
g['date'] = g.beg + pd.to_timedelta(g.groupby(level=0).cumcount(), unit='H')
s = g.groupby(['id', g['date'].dt.strftime('%m/%Y')]).size()
out.append(s)
df = pd.concat(out).sum(level=[0,1]).reset_index(name='count')
print (df)
I hope the following code works for you:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
, 'beg':['2021-01-01 00:00:00',
'2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
'2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
'end':['2021-01-02 00:00:00 ',
'2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
'2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']})
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df["difference"] = df.end - df.beg
print(df.difference/ np.timedelta64(1, 'h'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.