[英]python: count hours between two date columns
Im new using stackoverflow我是新来使用stackoverflow的
I want to calculate per id and month nbr of hours that an employee is off, so technically the hours between (end and beg) two timestamp, what is the best way to get it please.我想计算员工休假的每个 id 和每月 nbr 小时数,所以从技术上讲,(结束和乞求)两个时间戳之间的小时数,请问获得它的最佳方法是什么。
import pandas as pd
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
, 'beg':['2021-01-01 00:00:00',
'2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
'2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
'end':['2021-01-02 00:00:00 ',
'2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
'2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']}
Expected output预期 output
x1 01/2021 24
x1 02/2021 22
x2 02/2021 20
x2 03/2021 58
x2 04/2021 552
x2 05/2021 168(08/05/2021 = 24*7)
You can create hours by date_range
and then aggregate counts by GroupBy.size
:您可以按
date_range
创建小时数,然后按GroupBy.size
汇总计数:
df1 = pd.concat([pd.Series(r.id,pd.date_range(r.beg, r.end, freq='H', closed='left'))
for r in df.itertuples()]).reset_index()
df1.columns=['date','id']
df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
Or use DataFrame.explode
:或者使用
DataFrame.explode
:
df['date'] = df.apply(lambda x: pd.date_range(x.beg, x.end,freq='H',closed='left'), axis=1)
df1 = df.explode('date')
df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
id date count
0 x1 01/2021 24
1 x1 02/2021 22
2 x2 02/2021 20
3 x2 03/2021 58
4 x2 04/2021 720
5 x2 05/2021 168
EDIT:编辑:
Solution for better performance - is created Series
in hours for differency, then repeat index
values with DataFrame.loc
and then add hours timedeltas to hours:获得更好性能的解决方案 - 以小时为单位创建
Series
以获得差异,然后使用DataFrame.loc
重复index
值,然后将小时时间增量添加到小时:
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
dif = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
df = df.loc[df.index.repeat(dif)].copy()
df['date'] = df.beg + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='H')
print (df)
df1 = df.groupby(['id', df['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
id date count
0 x1 01/2021 24
1 x1 02/2021 22
2 x2 02/2021 20
3 x2 03/2021 58
4 x2 04/2021 720
5 x2 05/2021 168
EDIT:编辑:
This solution should working in large DataFrame:此解决方案应适用于大型 DataFrame:
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df['dif'] = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
N = 5000
out = []
for n, g in df.groupby(np.arange(len(df.index))/N):
g = g.loc[g.index.repeat(g['dif'])].copy()
g['date'] = g.beg + pd.to_timedelta(g.groupby(level=0).cumcount(), unit='H')
s = g.groupby(['id', g['date'].dt.strftime('%m/%Y')]).size()
out.append(s)
df = pd.concat(out).sum(level=[0,1]).reset_index(name='count')
print (df)
I hope the following code works for you:我希望以下代码对您有用:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
, 'beg':['2021-01-01 00:00:00',
'2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
'2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
'end':['2021-01-02 00:00:00 ',
'2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
'2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']})
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df["difference"] = df.end - df.beg
print(df.difference/ np.timedelta64(1, 'h'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.