I have a dataframe with 10 columns:
id date value
1233 2014-10-3 1.123123
3412 2015-05-31 2.123123
3123 2015-05-31 5.6234234
3123 2013-03-21 5.6234222
3412 2014-11-21 4.776666
5121 2015-08-22 5.234234
I want to group by id
column and take the latest date
. But I don't want to take the maximum of value
column. I want to take the value fo such row, that belongs to the maximum date.
pd.groupby('id').max()
doesn't work. How can I solve it?
The most important thing, that I want to keep all columns in my dataset.
You can use boolean indexing to select the max date in a group and return that row per group:
df.groupby('id').apply(lambda x: x.loc[x.date == x.date.max(),['date','value']])
Or, use idxmax
to select the index of that maximum value in each group:
df.groupby('id').apply(lambda x: x.loc[x.date.idxmax(),['date','value']]).reset_index()
Output:
id date value
0 1233 2014-10-03 1.123123
1 3123 2015-05-31 5.623423
2 3412 2015-05-31 2.123123
3 5121 2015-08-22 5.234234
Or you can simply using sort_value
then first
df.sort_values(['date', 'value'], ascending=[False, True]).groupby('id').first()
Out[480]:
date value
id
1233 2014-10-03 1.123123
3123 2015-05-31 5.623423
3412 2015-05-31 2.123123
5121 2015-08-22 5.234234
You could sort by date, then keep only the first appearance of each id.
df = df.sort_values('date', ascending=False)
most_recent = df.drop_duplicates('id', keep='first')
most_recent
Out:
id date value
0 5121 2015-08-22 5.234234
1 3412 2015-05-31 2.123123
2 3123 2015-05-31 5.623423
4 1233 2014-10-3 1.123123
如果要返回包含最大日期的整行,则需要使用idxmax
:
result_row = df.iloc[df['date'].idxmax()]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.