How to create multiple list aggregations using groupby on a pandas dataframe in Python?

Question

Take the following example pandas DataFrame:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                   "start": ["jan1", "jan1", "jan4", "feb17", "jan4", "mar3"],
                   "end": ["jan3", "jan3", "jan21", "feb17", "jan21", "mar4"],
                   "duration": [2, 2, 17, 0, 17, 1],
                   "case_id": ["case1", "case43", "case6", "case1", "case22", "case69"]
                  })

I want to use a pandas groupby operation on columns start , end and duration to perform two list aggregations on the dataframe:

a list of the id values for each group
a list of the case_id values for each group

My desired output would look like this:

start    end    duration    ids    cases
jan1     jan3   2           [1, 2] [case1, case43]
jan4     jan21  17          [3, 5] [case6, case22]
feb17    feb17  0           [4]    [case1]
mar3     mar4   1           [6]    [case69]

How to do this efficiently using pandas groupby?

I know that if I would need only one aggregation I could do it like this:

df = df.groupby(['start', 'end', 'duration'])['id'].apply(list).to_frame()

How to do this for multiple list aggregations? And if there are multiple options, what would be the least time consuming? (the DataFrames I'm transforming are quite large)

Answer 1

You will need to use pandas.groupby.agg , and specify the columns you want to return as list .

To lessen the time needed, since you have categorical columns in your data, make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present (more information on this here )

res = df.groupby(['start', 'end', 'duration'],observed=True)[['id','case_id']].agg(list).reset_index().sort_values(by='id')

Output :

res
Out[164]: 
   start    end  duration      id          case_id
1   jan1   jan3         2  [1, 2]  [case1, case43]
2   jan4  jan21        17  [3, 5]  [case6, case22]
0  feb17  feb17         0     [4]          [case1]
3   mar3   mar4         1     [6]         [case69]

Assuming that your unique categories are not too many and your dataset is not excessively large, this shouldn't be a problem. Generally, processing strings takes a lot longer than processing numbers, so if this takes too long to run, you could try converting your object columns to numeric columns and re-doing your groupby .

How to create multiple list aggregations using groupby on a pandas dataframe in Python?

Question

1 answers

solution1
1 ACCPTED 2021-03-27 11:40:35

How to create multiple list aggregations using groupby on a pandas dataframe in Python?

Question

1 answers

solution1 1 ACCPTED 2021-03-27 11:40:35

solution1
1 ACCPTED 2021-03-27 11:40:35