简体   繁体   中英

How do I return only the most recent rows in Pandas?

I am working with job applications from candidates, some candidates submit multiple applications and my goal is to reduce the data set down to only the most recent application from each candidate.

My code is as follows:

import pandas as pd

data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
        'candidate_id' : ["001", "002", "002", "002", "002"]
       }

df = pd.DataFrame(data)

df['application_date'] = pd.to_datetime(df['application_date'])

df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first')

This returns the following:

     application_date candidate_id  rank_application
0 2020-09-11 10:30:31          001               1.0
1 2020-09-11 11:07:59          002               1.0
2 2020-09-11 11:09:02          002               2.0
3 2020-09-14 13:14:31          002               3.0
4 2020-09-14 13:15:15          002               4.0

This is where I am stuck. From here I do not know how to only reduce the df to only the most recent per candidate_id. I was originally hoping to order descending and then figure out how to take the rows where rank_application = 1 (but I can't figure it out)

Here's what you need:

import pandas as pd

data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
        'candidate_id' : ["001", "002", "002", "002", "002"]
       }

df = pd.DataFrame(data)

df['application_date'] = pd.to_datetime(df['application_date'], infer_datetime_format=True)

result = df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]

print(result)

Result:

     application_date candidate_id
0 2020-09-11 10:30:31          001
4 2020-09-14 13:15:15          002

.iloc[] takes a series of indices to get the appropriate rows. The pd.to_datetime statement may be needed to force the application_date to be a suitable datetime format for pd.Series.idxmax to work.

First, because this is sorting and select in the time data, you should convert your column to pandas date_time to be well operated on pandas by pd.to_datetime .

Then you can select the ['application_date'] by choosing the maximum value in the series of time by df['application_date'].agg(pd.Series.idxmax) . However, because you are looking for the latest time in different id or rank, you need to add a groupby to help the maximum selection for each id.

df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)

If you want to select the application date: you can easily index them by iloc

df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]

I am a little late with this answer. I happened to stumble upon this post while I was searching for something similar.

This is what I usually do when I am trying to find the most recent record.

df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first', ascending=False)
df = df[df.rank_application == 1]

The initial approach posted in the question is what I follow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM