简体   繁体   中英

How to create multiple list aggregations using groupby on a pandas dataframe in Python?

Take the following example pandas DataFrame:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                   "start": ["jan1", "jan1", "jan4", "feb17", "jan4", "mar3"],
                   "end": ["jan3", "jan3", "jan21", "feb17", "jan21", "mar4"],
                   "duration": [2, 2, 17, 0, 17, 1],
                   "case_id": ["case1", "case43", "case6", "case1", "case22", "case69"]
                  })

I want to use a pandas groupby operation on columns start , end and duration to perform two list aggregations on the dataframe:

  • a list of the id values for each group
  • a list of the case_id values for each group

My desired output would look like this:

start    end    duration    ids    cases
jan1     jan3   2           [1, 2] [case1, case43]
jan4     jan21  17          [3, 5] [case6, case22]
feb17    feb17  0           [4]    [case1]
mar3     mar4   1           [6]    [case69]

How to do this efficiently using pandas groupby?

I know that if I would need only one aggregation I could do it like this:

df = df.groupby(['start', 'end', 'duration'])['id'].apply(list).to_frame()

How to do this for multiple list aggregations? And if there are multiple options, what would be the least time consuming? (the DataFrames I'm transforming are quite large)

You will need to use pandas.groupby.agg , and specify the columns you want to return as list .

To lessen the time needed, since you have categorical columns in your data, make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present (more information on this here )

res = df.groupby(['start', 'end', 'duration'],observed=True)[['id','case_id']].agg(list).reset_index().sort_values(by='id')

Output :

res
Out[164]: 
   start    end  duration      id          case_id
1   jan1   jan3         2  [1, 2]  [case1, case43]
2   jan4  jan21        17  [3, 5]  [case6, case22]
0  feb17  feb17         0     [4]          [case1]
3   mar3   mar4         1     [6]         [case69]

Assuming that your unique categories are not too many and your dataset is not excessively large, this shouldn't be a problem. Generally, processing strings takes a lot longer than processing numbers, so if this takes too long to run, you could try converting your object columns to numeric columns and re-doing your groupby .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM