简体   繁体   English

如何仅返回 Pandas 中的最新行?

[英]How do I return only the most recent rows in Pandas?

I am working with job applications from candidates, some candidates submit multiple applications and my goal is to reduce the data set down to only the most recent application from each candidate.我正在处理候选人的工作申请,一些候选人提交了多个申请,我的目标是将数据集减少到每个候选人的最新申请。

My code is as follows:我的代码如下:

import pandas as pd

data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
        'candidate_id' : ["001", "002", "002", "002", "002"]
       }

df = pd.DataFrame(data)

df['application_date'] = pd.to_datetime(df['application_date'])

df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first')

This returns the following:这将返回以下内容:

     application_date candidate_id  rank_application
0 2020-09-11 10:30:31          001               1.0
1 2020-09-11 11:07:59          002               1.0
2 2020-09-11 11:09:02          002               2.0
3 2020-09-14 13:14:31          002               3.0
4 2020-09-14 13:15:15          002               4.0

This is where I am stuck.这就是我被困的地方。 From here I do not know how to only reduce the df to only the most recent per candidate_id.从这里我不知道如何只将 df 减少到每个候选 ID 的最新值。 I was originally hoping to order descending and then figure out how to take the rows where rank_application = 1 (but I can't figure it out)我最初希望按降序排序,然后弄清楚如何获取 rank_application = 1 的行(但我想不通)

Here's what you need:这是您需要的:

import pandas as pd

data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
        'candidate_id' : ["001", "002", "002", "002", "002"]
       }

df = pd.DataFrame(data)

df['application_date'] = pd.to_datetime(df['application_date'], infer_datetime_format=True)

result = df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]

print(result)

Result:结果:

     application_date candidate_id
0 2020-09-11 10:30:31          001
4 2020-09-14 13:15:15          002

.iloc[] takes a series of indices to get the appropriate rows. .iloc[]采用一系列索引来获取适当的行。 The pd.to_datetime statement may be needed to force the application_date to be a suitable datetime format for pd.Series.idxmax to work.可能需要pd.to_datetime语句来强制application_date成为适合pd.Series.idxmax工作的日期时间格式。

First, because this is sorting and select in the time data, you should convert your column to pandas date_time to be well operated on pandas by pd.to_datetime .首先,因为这是在时间数据中进行排序和选择,您应该将您的列转换为 pandas date_time 以便通过pd.to_datetime对 pandas 进行良好操作。

Then you can select the ['application_date'] by choosing the maximum value in the series of time by df['application_date'].agg(pd.Series.idxmax) .然后您可以通过df['application_date'].agg(pd.Series.idxmax)选择时间序列中的最大值来选择df['application_date'].agg(pd.Series.idxmax) However, because you are looking for the latest time in different id or rank, you need to add a groupby to help the maximum selection for each id.但是,因为你是在不同的id或rank中寻找最新的时间,所以需要添加一个groupby来帮助每个id的最大选择。

df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)

If you want to select the application date: you can easily index them by iloc如果您想选择申请日期:您可以通过iloc轻松索引它们

df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]

I am a little late with this answer.我对这个答案有点晚了。 I happened to stumble upon this post while I was searching for something similar.我在寻找类似的东西时偶然发现了这篇文章。

This is what I usually do when I am trying to find the most recent record.这是我在尝试查找最新记录时通常会做的事情。

df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first', ascending=False)
df = df[df.rank_application == 1]

The initial approach posted in the question is what I follow.问题中发布的初始方法是我遵循的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM