Take the following example pandas DataFrame:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
"start": ["jan1", "jan1", "jan4", "feb17", "jan4", "mar3"],
"end": ["jan3", "jan3", "jan21", "feb17", "jan21", "mar4"],
"duration": [2, 2, 17, 0, 17, 1],
"case_id": ["case1", "case43", "case6", "case1", "case22", "case69"]
})
I want to use a pandas groupby operation on columns start
, end
and duration
to perform two list aggregations on the dataframe:
id
values for each groupcase_id
values for each groupMy desired output would look like this:
start end duration ids cases
jan1 jan3 2 [1, 2] [case1, case43]
jan4 jan21 17 [3, 5] [case6, case22]
feb17 feb17 0 [4] [case1]
mar3 mar4 1 [6] [case69]
How to do this efficiently using pandas groupby?
I know that if I would need only one aggregation I could do it like this:
df = df.groupby(['start', 'end', 'duration'])['id'].apply(list).to_frame()
How to do this for multiple list aggregations? And if there are multiple options, what would be the least time consuming? (the DataFrames I'm transforming are quite large)
You will need to use pandas.groupby.agg
, and specify the columns you want to return as list
.
To lessen the time needed, since you have categorical columns in your data, make sure you use the observed=True
option in your groupby command. This makes sure it only creates lines where an entry is present (more information on this here )
res = df.groupby(['start', 'end', 'duration'],observed=True)[['id','case_id']].agg(list).reset_index().sort_values(by='id')
Output :
res
Out[164]:
start end duration id case_id
1 jan1 jan3 2 [1, 2] [case1, case43]
2 jan4 jan21 17 [3, 5] [case6, case22]
0 feb17 feb17 0 [4] [case1]
3 mar3 mar4 1 [6] [case69]
Assuming that your unique categories are not too many and your dataset is not excessively large, this shouldn't be a problem. Generally, processing strings takes a lot longer than processing numbers, so if this takes too long to run, you could try converting your object
columns to numeric columns and re-doing your groupby
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.