基于行Python聚合数据

Question

我有一个数据集，看起来像这样：

      Date          | ID |  Task |   Description
2016-01-06 00:00:00 | 1  |  010  |   This is text
2016-01-06 00:10:00 | 1  |  020  |   This is text
2016-01-06 00:20:00 | 1  |  010  |   This is text
2016-01-06 01:00:00 | 1  |  020  |   This is text
2016-01-06 01:10:00 | 1  |  030  |   This is text
2016-02-06 00:00:00 | 2  |  010  |   This is text
2016-02-06 00:10:00 | 2  |  020  |   This is text
2016-02-06 00:20:00 | 2  |  010  |   This is text
2016-02-06 01:00:00 | 2  |  020  |   This is text
2016-02-06 01:01:00 | 2  |  030  |   This is text

任务020通常在任务010之后发生，这意味着当任务020开始意味着任务010结束时，同样适用于任务020 ，如果它在任何其他任务之前出现，则意味着它已经停止。

我需要按Task分组，以计算每个ID中每种Task的平均持续时间 ，总和和计数，因此我正在寻找类似以下内容的东西：

ID  | Task | Average | Sum | Count
1   |  010 |   25    | 50  |  2 
1   |  020 |   10    | 20  |  2
etc |  etc |   etc   | etc |  etc

ID较多，但我只关心010和020 ，因此可以接受它们返回的任何数字。

有人可以帮忙在Python中执行此操作吗？ 这远远超出了我目前的技能。

我正在使用anaconda发行版。

非常感谢高级。

Answer 1

我认为这是您需要的简单.groupby() 。 您的示例输出未显示时间戳与Task或ID之间的任何复杂链接

df['count'] = df.groupby(['ID','Task']).size()

将为您提供数据中每个唯一ID /任务的计数。 要进行求和或求平均值，这很相似，但是您需要一列要求和的列。

有关更多详细信息，请参见此处。

Answer 2

似乎您需要使用groupby agg ，但是在示例中不是数字列，因此添加了col ：

print (df)
                  Date  ID Task   Description      col
0  2016-01-06 00:00:00   1  010  This is text        1
1  2016-01-06 00:10:00   1  020  This is text        2
2  2016-01-06 00:20:00   1  010  This is text        6
3  2016-01-06 01:00:00   1  020  This is text        1
4  2016-01-06 01:10:00   1  030  This is text        3
5  2016-02-06 00:00:00   2  010  This is text        1
6  2016-02-06 00:10:00   2  020  This is text        8
7  2016-02-06 00:20:00   2  010  This is text        9
8  2016-02-06 01:00:00   2  020  This is text        1

df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
   ID Task  sum  size  mean
0   1  010    7     2   3.5
1   1  020    3     2   1.5
2   1  030    3     1   3.0
3   2  010   10     2   5.0
4   2  020    9     2   4.5

如果需要聚集日期时间，则id有点复杂，因为需要timedeltas ：

df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
       .agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
   ID Task             sum  size            mean
0   1  010 00:00:02.904078     2 00:00:01.452039
1   1  020 00:00:02.904081     2 00:00:01.452040
2   1  030 00:00:01.452042     1 00:00:01.452042
3   2  010 00:00:02.909434     2 00:00:01.454717
4   2  020 00:00:02.909437     2 00:00:01.454718

为了找到列date差异：

print (df.Date.dtypes)
object

#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0                NaT
1    0 days 00:10:00
2    0 days 00:10:00
3    0 days 00:40:00
4    0 days 00:10:00
5   30 days 22:50:00
6    0 days 00:10:00
7    0 days 00:10:00
8    0 days 00:40:00
9    0 days 00:01:00
Name: Date, dtype: timedelta64[ns]

基于行Python聚合数据

问题描述

2 个解决方案

解决方案1
0 2017-04-03 11:36:06

解决方案2
0 2017-04-03 11:41:37

基于行Python聚合数据

问题描述

2 个解决方案

解决方案1 0 2017-04-03 11:36:06

解决方案2 0 2017-04-03 11:41:37

解决方案1
0 2017-04-03 11:36:06

解决方案2
0 2017-04-03 11:41:37