[英]How do I return only the most recent rows in Pandas?
I am working with job applications from candidates, some candidates submit multiple applications and my goal is to reduce the data set down to only the most recent application from each candidate.我正在处理候选人的工作申请,一些候选人提交了多个申请,我的目标是将数据集减少到每个候选人的最新申请。
My code is as follows:我的代码如下:
import pandas as pd
data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
'candidate_id' : ["001", "002", "002", "002", "002"]
}
df = pd.DataFrame(data)
df['application_date'] = pd.to_datetime(df['application_date'])
df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first')
This returns the following:这将返回以下内容:
application_date candidate_id rank_application
0 2020-09-11 10:30:31 001 1.0
1 2020-09-11 11:07:59 002 1.0
2 2020-09-11 11:09:02 002 2.0
3 2020-09-14 13:14:31 002 3.0
4 2020-09-14 13:15:15 002 4.0
This is where I am stuck.这就是我被困的地方。 From here I do not know how to only reduce the df to only the most recent per candidate_id.从这里我不知道如何只将 df 减少到每个候选 ID 的最新值。 I was originally hoping to order descending and then figure out how to take the rows where rank_application = 1 (but I can't figure it out)我最初希望按降序排序,然后弄清楚如何获取 rank_application = 1 的行(但我想不通)
Here's what you need:这是您需要的:
import pandas as pd
data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
'candidate_id' : ["001", "002", "002", "002", "002"]
}
df = pd.DataFrame(data)
df['application_date'] = pd.to_datetime(df['application_date'], infer_datetime_format=True)
result = df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]
print(result)
Result:结果:
application_date candidate_id
0 2020-09-11 10:30:31 001
4 2020-09-14 13:15:15 002
.iloc[]
takes a series of indices to get the appropriate rows. .iloc[]
采用一系列索引来获取适当的行。 The pd.to_datetime
statement may be needed to force the application_date
to be a suitable datetime format for pd.Series.idxmax
to work.可能需要pd.to_datetime
语句来强制application_date
成为适合pd.Series.idxmax
工作的日期时间格式。
First, because this is sorting and select in the time data, you should convert your column to pandas date_time to be well operated on pandas by pd.to_datetime
.首先,因为这是在时间数据中进行排序和选择,您应该将您的列转换为 pandas date_time 以便通过pd.to_datetime
对 pandas 进行良好操作。
Then you can select the ['application_date'] by choosing the maximum value in the series of time by df['application_date'].agg(pd.Series.idxmax)
.然后您可以通过df['application_date'].agg(pd.Series.idxmax)
选择时间序列中的最大值来选择df['application_date'].agg(pd.Series.idxmax)
。 However, because you are looking for the latest time in different id or rank, you need to add a groupby
to help the maximum selection for each id.但是,因为你是在不同的id或rank中寻找最新的时间,所以需要添加一个groupby
来帮助每个id的最大选择。
df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)
If you want to select the application date: you can easily index them by iloc
如果您想选择申请日期:您可以通过iloc
轻松索引它们
df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]
I am a little late with this answer.我对这个答案有点晚了。 I happened to stumble upon this post while I was searching for something similar.我在寻找类似的东西时偶然发现了这篇文章。
This is what I usually do when I am trying to find the most recent record.这是我在尝试查找最新记录时通常会做的事情。
df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first', ascending=False)
df = df[df.rank_application == 1]
The initial approach posted in the question is what I follow.问题中发布的初始方法是我遵循的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.