简体   繁体   中英

Aggregate Data Based on Rows Python

I have a data set that looks like this:

      Date          | ID |  Task |   Description
2016-01-06 00:00:00 | 1  |  010  |   This is text
2016-01-06 00:10:00 | 1  |  020  |   This is text
2016-01-06 00:20:00 | 1  |  010  |   This is text
2016-01-06 01:00:00 | 1  |  020  |   This is text
2016-01-06 01:10:00 | 1  |  030  |   This is text
2016-02-06 00:00:00 | 2  |  010  |   This is text
2016-02-06 00:10:00 | 2  |  020  |   This is text
2016-02-06 00:20:00 | 2  |  010  |   This is text
2016-02-06 01:00:00 | 2  |  020  |   This is text
2016-02-06 01:01:00 | 2  |  030  |   This is text

Task 020 usually occurs after task 010 which means that when Task 020 starts means that task 010 ends, same applies for Task 020 , if it comes before any other Task it means that it has stopped.

I need to group by Task calculating the average duration , total sum and count of each type of task in each ID , so I am looking for something like this:

ID  | Task | Average | Sum | Count
1   |  010 |   25    | 50  |  2 
1   |  020 |   10    | 20  |  2
etc |  etc |   etc   | etc |  etc

There are more IDs but I only care about 010 and 020 , so whatever number is returned from them is acceptable.

Can someone please help how to do this in Python? This is well beyond my current skills.

I am using the anaconda distribution.

Many thanks in advanced.

I think it's a simple .groupby() that you need. You sample output doesn't show any complicated linking between timestamps and Task or ID

df['count'] = df.groupby(['ID','Task']).size()

will give you the count of each unique ID/Task in your data. To do a sum or average, it's similar, but you need a column with something to sum.

See here for more details.

It seems you need agg with groupby , but in sample not numeric column so col was added:

print (df)
                  Date  ID Task   Description      col
0  2016-01-06 00:00:00   1  010  This is text        1
1  2016-01-06 00:10:00   1  020  This is text        2
2  2016-01-06 00:20:00   1  010  This is text        6
3  2016-01-06 01:00:00   1  020  This is text        1
4  2016-01-06 01:10:00   1  030  This is text        3
5  2016-02-06 00:00:00   2  010  This is text        1
6  2016-02-06 00:10:00   2  020  This is text        8
7  2016-02-06 00:20:00   2  010  This is text        9
8  2016-02-06 01:00:00   2  020  This is text        1

df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
   ID Task  sum  size  mean
0   1  010    7     2   3.5
1   1  020    3     2   1.5
2   1  030    3     1   3.0
3   2  010   10     2   5.0
4   2  020    9     2   4.5

If need aggreagte datetime, id is a bit complicated, because need timedeltas :

df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
       .agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
   ID Task             sum  size            mean
0   1  010 00:00:02.904078     2 00:00:01.452039
1   1  020 00:00:02.904081     2 00:00:01.452040
2   1  030 00:00:01.452042     1 00:00:01.452042
3   2  010 00:00:02.909434     2 00:00:01.454717
4   2  020 00:00:02.909437     2 00:00:01.454718

For finding difference in column date :

print (df.Date.dtypes)
object

#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0                NaT
1    0 days 00:10:00
2    0 days 00:10:00
3    0 days 00:40:00
4    0 days 00:10:00
5   30 days 22:50:00
6    0 days 00:10:00
7    0 days 00:10:00
8    0 days 00:40:00
9    0 days 00:01:00
Name: Date, dtype: timedelta64[ns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM