[英]pandas calculate difference between two columns
Im new using stackoverflow我是新来使用stackoverflow的
I want to calculate per id and month, the hours between (end and beg) two timestamp, what is the best way to get it please.我想计算每个 id 和月份,(结束和乞求)两个时间戳之间的小时数,请问获得它的最佳方法是什么。
import pandas as pd
df = pd.DataFrame({'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2']
, 'beg':['2021-01-01 00:00:00',
'2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00',
'2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-01 00:00:00'],
'end':['2021-01-02 00:00:00 ',
'2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00',
'2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']}
Expected output预期 output
x1 01/2021 24
x1 02/2021 22
x2 02/2021 20
x2 03/2021 58
x2 04/2021 720
x2 05/2021 192
calculate the difference then groupby id and month.计算差异,然后按 id 和月份分组。 get the sum of the difference and calculate the hours
得到差的总和并计算小时数
df.assign(diff=df[['beg', 'end']].diff(axis=1)['end']).groupby(['id', df['beg'].dt.strftime('%m/%Y')]).agg('sum')/np.timedelta64(1, 'h')
diff
id beg
x1 01/2021 24.0
02/2021 22.0
x2 02/2021 20.0
03/2021 58.0
04/2021 720.0
First, we have to do some work-around to proper label each month:首先,我们必须每月做一些解决方法来正确 label:
# Convert your data to datetime
df[['beg','end']] = df[['beg','end']].astype('datetime64[ns]')
# Identify rows with distinct months
months_diff = df.beg.dt.month < df.end.dt.month
# Function to split the months for posterior time comparison
def deal_with_diff_months(row):
actual_month = [row['id'], row['beg'], row['end'].floor('d')]
next_month = [row['id'], row['end'].floor('d'), row['end']]
return actual_month, next_month
# Create a new dataframe for split months
df_tmp = df[months_diff].apply(deal_with_diff_months, axis=1)
df_tmp = pd.DataFrame(df_tmp.explode().tolist(), columns=df.columns)
# Renew dataframe with split months
df = df[~months_diff].append(df_tmp)
Now we can use the code chunk below as originally answered:现在我们可以使用下面最初回答的代码块:
# Create a new column to group by month as well
df['month'] = df['beg'].dt.strftime('%m/%Y')
# Group by id and month, then calculate and sum the difference
result = df.groupby(['id','month']).apply(lambda x: (x['end'] - x['beg']).sum())
# Convert the difference to hours
result = (result.dt.total_seconds()/60/60).astype(int)
Output: Output:
id month
x1 01/2021 24
02/2021 22
x2 02/2021 20
03/2021 58
04/2021 720
05/2021 0
You may try this:你可以试试这个:
df = pd.DataFrame(
{'id':['x1', 'x1', 'x1', 'x2', 'x2', 'x2', 'x2'],
'beg':['2021-01-01 00:00:00', '2021-02-03 00:00:00','2021-02-04 00:00:00','2021-02-05 00:00:00','2021-02-06 00:00:00','2021-03-05 00:00:00','2021-04-08 00:00:00'],
'end':['2021-01-02 00:00:00','2021-02-03 12:00:00','2021-02-04 10:00:00','2021-02-05 10:00:00','2021-02-06 10:00:00','2021-03-07 10:00:00','2021-05-08 00:00:00']})
df['beg'] = pd.to_datetime(df['beg'], format='%Y-%m-%d %H:%M:%S')
df['end'] = pd.to_datetime(df['end'], format='%Y-%m-%d %H:%M:%S')
hours_diff = []
for i in range(len(df)):
diff = df['end'][i] - df['beg'][i]
days, seconds = diff.days, diff.seconds
hours = days * 24 + seconds // 3600
hours_diff.append(hours)
df['hours_diff'] = hours_diff
print(df)
Output: Output:
id beg end hours_diff
0 x1 2021-01-01 2021-01-02 00:00:00 24
1 x1 2021-02-03 2021-02-03 12:00:00 12
2 x1 2021-02-04 2021-02-04 10:00:00 10
3 x2 2021-02-05 2021-02-05 10:00:00 10
4 x2 2021-02-06 2021-02-06 10:00:00 10
5 x2 2021-03-05 2021-03-07 10:00:00 58
6 x2 2021-04-08 2021-05-08 00:00:00 720
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.