python pandas：groupby中2个日期之间的差异

Question

Using Python 3.6 and Pandas 0.19.2: 使用Python 3.6和Pandas 0.19.2：

I have a DataFrame containing parsed log files for transactions. 我有一个DataFrame包含用于事务的已解析日志文件。 Each line is timestamped, contains a transactionid, and can either represent the beginning or the end of a transaction (so each transactionid has 1 line for start and 1 line for end). 每一行都带有时间戳，包含一个transactionid，可以表示事务的开始或结束（因此每个transactionid有1行用于开始，1行用于结束）。

Additional infos can also be present in each end line. 每个终点线中还可以存在其他信息。

I would like to extract the duration of each transaction by substracting end date with startdate, and keep the additional infos. 我想通过使用startdate减去结束日期来提取每个事务的持续时间，并保留其他信息。

Sample input: 样本输入：

import pandas as pd
import io
df = pd.read_csv(io.StringIO('''transactionid;event;datetime;info
1;START;2017-04-01 00:00:00;
1;END;2017-04-01 00:00:02;foo1
2;START;2017-04-01 00:00:02;
3;START;2017-04-01 00:00:02;
2;END;2017-04-01 00:00:03;foo2
4;START;2017-04-01 00:00:03;
3;END;2017-04-01 00:00:03;foo3
4;END;2017-04-01 00:00:04;foo4'''), sep=';', parse_dates=['datetime'])

Which gives the following DataFrame: 这给出了以下DataFrame：

   transactionid  event             datetime  info
0              1  START  2017-04-01 00:00:00   NaN
1              1    END  2017-04-01 00:00:02  foo1
2              2  START  2017-04-01 00:00:02   NaN
3              3  START  2017-04-01 00:00:02   NaN
4              2    END  2017-04-01 00:00:03  foo2
5              4  START  2017-04-01 00:00:03   NaN
6              3    END  2017-04-01 00:00:03  foo3
7              4    END  2017-04-01 00:00:04  foo4

Expected output: 预期产量：

A new dataframe such as: 一个新的数据框，例如：

   transactionid           start_date             end_date  duration  info
0              1  2017-04-01 00:00:00  2017-04-01 00:00:02  00:00:02  foo1
1              2  2017-04-01 00:00:02  2017-04-01 00:00:03  00:00:01  foo2
2              3  2017-04-01 00:00:02  2017-04-01 00:00:03  00:00:01  foo3
3              4  2017-04-01 00:00:03  2017-04-01 00:00:04  00:00:01  foo4

What I have tried: 我尝试过的：

Since 2 consecutives lines are not always related to the same transaction, I applied a .groupby(by='transactionid') to my dataframe. 由于2个连续行并不总是与同一个事务相关，因此我将.groupby(by='transactionid')应用于我的数据帧。 I am now stuck trying to "flatten" each group according to my needs. 我现在被困在试图根据我的需要“压扁”每个小组。

Answer 1

try this: 试试这个：

df.datetime = pd.to_datetime(df.datetime)

funcs = {
    'datetime':{
        'start_date':   'min',
        'end_date':     'max',
        'duration':     lambda x: x.max() - x.min(),
    },
    'info':             'last'
}

df.groupby(by='transactionid')['datetime','info'].agg(funcs).reset_index()

Result: 结果：

In [103]: df.groupby(by='transactionid')['datetime','info'].agg(funcs).reset_index()
Out[103]:
   transactionid          start_date            end_date  duration  last
0              1 2017-04-01 00:00:00 2017-04-01 00:00:02  00:00:02  foo1
1              2 2017-04-01 00:00:02 2017-04-01 00:00:03  00:00:01  foo2
2              3 2017-04-01 00:00:02 2017-04-01 00:00:03  00:00:01  foo3
3              4 2017-04-01 00:00:03 2017-04-01 00:00:04  00:00:01  foo4

python pandas：groupby中2个日期之间的差异

问题描述

1 个解决方案

解决方案1
13 已采纳 2017-04-25 13:01:03

python pandas：groupby中2个日期之间的差异

问题描述

1 个解决方案

解决方案1 13 已采纳 2017-04-25 13:01:03

解决方案1
13 已采纳 2017-04-25 13:01:03