[英]python pandas: diff between 2 dates in a groupby
Using Python 3.6 and Pandas 0.19.2: 使用Python 3.6和Pandas 0.19.2:
I have a DataFrame containing parsed log files for transactions. 我有一个DataFrame包含用于事务的已解析日志文件。 Each line is timestamped, contains a transactionid, and can either represent the beginning or the end of a transaction (so each transactionid has 1 line for start and 1 line for end).
每一行都带有时间戳,包含一个transactionid,可以表示事务的开始或结束(因此每个transactionid有1行用于开始,1行用于结束)。
Additional infos can also be present in each end line. 每个终点线中还可以存在其他信息。
I would like to extract the duration of each transaction by substracting end date with startdate, and keep the additional infos. 我想通过使用startdate减去结束日期来提取每个事务的持续时间,并保留其他信息。
Sample input: 样本输入:
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''transactionid;event;datetime;info
1;START;2017-04-01 00:00:00;
1;END;2017-04-01 00:00:02;foo1
2;START;2017-04-01 00:00:02;
3;START;2017-04-01 00:00:02;
2;END;2017-04-01 00:00:03;foo2
4;START;2017-04-01 00:00:03;
3;END;2017-04-01 00:00:03;foo3
4;END;2017-04-01 00:00:04;foo4'''), sep=';', parse_dates=['datetime'])
Which gives the following DataFrame: 这给出了以下DataFrame:
transactionid event datetime info
0 1 START 2017-04-01 00:00:00 NaN
1 1 END 2017-04-01 00:00:02 foo1
2 2 START 2017-04-01 00:00:02 NaN
3 3 START 2017-04-01 00:00:02 NaN
4 2 END 2017-04-01 00:00:03 foo2
5 4 START 2017-04-01 00:00:03 NaN
6 3 END 2017-04-01 00:00:03 foo3
7 4 END 2017-04-01 00:00:04 foo4
Expected output: 预期产量:
A new dataframe such as: 一个新的数据框,例如:
transactionid start_date end_date duration info
0 1 2017-04-01 00:00:00 2017-04-01 00:00:02 00:00:02 foo1
1 2 2017-04-01 00:00:02 2017-04-01 00:00:03 00:00:01 foo2
2 3 2017-04-01 00:00:02 2017-04-01 00:00:03 00:00:01 foo3
3 4 2017-04-01 00:00:03 2017-04-01 00:00:04 00:00:01 foo4
What I have tried: 我尝试过的:
Since 2 consecutives lines are not always related to the same transaction, I applied a .groupby(by='transactionid')
to my dataframe. 由于2个连续行并不总是与同一个事务相关,因此我将
.groupby(by='transactionid')
应用于我的数据帧。 I am now stuck trying to "flatten" each group according to my needs. 我现在被困在试图根据我的需要“压扁”每个小组。
try this: 试试这个:
df.datetime = pd.to_datetime(df.datetime)
funcs = {
'datetime':{
'start_date': 'min',
'end_date': 'max',
'duration': lambda x: x.max() - x.min(),
},
'info': 'last'
}
df.groupby(by='transactionid')['datetime','info'].agg(funcs).reset_index()
Result: 结果:
In [103]: df.groupby(by='transactionid')['datetime','info'].agg(funcs).reset_index()
Out[103]:
transactionid start_date end_date duration last
0 1 2017-04-01 00:00:00 2017-04-01 00:00:02 00:00:02 foo1
1 2 2017-04-01 00:00:02 2017-04-01 00:00:03 00:00:01 foo2
2 3 2017-04-01 00:00:02 2017-04-01 00:00:03 00:00:01 foo3
3 4 2017-04-01 00:00:03 2017-04-01 00:00:04 00:00:01 foo4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.