[英]Python - count number of elements that are equal between two columns of two dataframes
[英]python: count hours between two date columns
我是新來使用stackoverflow的
我想計算員工休假的每個 id 和每月 nbr 小時數,所以從技術上講,(結束和乞求)兩個時間戳之間的小時數,請問獲得它的最佳方法是什么。
import pandas as pd
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
, 'beg':['2021-01-01 00:00:00',
'2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
'2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
'end':['2021-01-02 00:00:00 ',
'2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
'2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']}
預期 output
x1 01/2021 24
x1 02/2021 22
x2 02/2021 20
x2 03/2021 58
x2 04/2021 552
x2 05/2021 168(08/05/2021 = 24*7)
您可以按date_range
創建小時數,然后按GroupBy.size
匯總計數:
df1 = pd.concat([pd.Series(r.id,pd.date_range(r.beg, r.end, freq='H', closed='left'))
for r in df.itertuples()]).reset_index()
df1.columns=['date','id']
df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
或者使用DataFrame.explode
:
df['date'] = df.apply(lambda x: pd.date_range(x.beg, x.end,freq='H',closed='left'), axis=1)
df1 = df.explode('date')
df1 = df1.groupby(['id', df1['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
id date count
0 x1 01/2021 24
1 x1 02/2021 22
2 x2 02/2021 20
3 x2 03/2021 58
4 x2 04/2021 720
5 x2 05/2021 168
編輯:
獲得更好性能的解決方案 - 以小時為單位創建Series
以獲得差異,然后使用DataFrame.loc
重復index
值,然后將小時時間增量添加到小時:
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
dif = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
df = df.loc[df.index.repeat(dif)].copy()
df['date'] = df.beg + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='H')
print (df)
df1 = df.groupby(['id', df['date'].dt.strftime('%m/%Y')]).size().reset_index(name='count')
print (df1)
id date count
0 x1 01/2021 24
1 x1 02/2021 22
2 x2 02/2021 20
3 x2 03/2021 58
4 x2 04/2021 720
5 x2 05/2021 168
編輯:
此解決方案應適用於大型 DataFrame:
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df['dif'] = df.end.sub(df.beg).dt.total_seconds().div(3600).astype(int)
N = 5000
out = []
for n, g in df.groupby(np.arange(len(df.index))/N):
g = g.loc[g.index.repeat(g['dif'])].copy()
g['date'] = g.beg + pd.to_timedelta(g.groupby(level=0).cumcount(), unit='H')
s = g.groupby(['id', g['date'].dt.strftime('%m/%Y')]).size()
out.append(s)
df = pd.concat(out).sum(level=[0,1]).reset_index(name='count')
print (df)
我希望以下代碼對您有用:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
, 'beg':['2021-01-01 00:00:00',
'2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
'2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
'end':['2021-01-02 00:00:00 ',
'2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
'2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']})
df.beg = pd.to_datetime(df.beg)
df.end = pd.to_datetime(df.end)
df["difference"] = df.end - df.beg
print(df.difference/ np.timedelta64(1, 'h'))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.