如何仅返回 Pandas 中的最新行？

Question

I am working with job applications from candidates, some candidates submit multiple applications and my goal is to reduce the data set down to only the most recent application from each candidate.我正在处理候选人的工作申请，一些候选人提交了多个申请，我的目标是将数据集减少到每个候选人的最新申请。

My code is as follows:我的代码如下：

import pandas as pd

data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
        'candidate_id' : ["001", "002", "002", "002", "002"]
       }

df = pd.DataFrame(data)

df['application_date'] = pd.to_datetime(df['application_date'])

df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first')

This returns the following:这将返回以下内容：

     application_date candidate_id  rank_application
0 2020-09-11 10:30:31          001               1.0
1 2020-09-11 11:07:59          002               1.0
2 2020-09-11 11:09:02          002               2.0
3 2020-09-14 13:14:31          002               3.0
4 2020-09-14 13:15:15          002               4.0

This is where I am stuck.这就是我被困的地方。 From here I do not know how to only reduce the df to only the most recent per candidate_id.从这里我不知道如何只将 df 减少到每个候选 ID 的最新值。 I was originally hoping to order descending and then figure out how to take the rows where rank_application = 1 (but I can't figure it out)我最初希望按降序排序，然后弄清楚如何获取 rank_application = 1 的行（但我想不通）

Answer 1

Here's what you need:这是您需要的：

import pandas as pd

data = {'application_date' : ["9/11/2020 10:30:31", "9/11/2020 11:07:59", "9/11/2020 11:09:02", "9/14/2020 13:14:31", "9/14/2020 13:15:15"],
        'candidate_id' : ["001", "002", "002", "002", "002"]
       }

df = pd.DataFrame(data)

df['application_date'] = pd.to_datetime(df['application_date'], infer_datetime_format=True)

result = df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]

print(result)

Result:结果：

     application_date candidate_id
0 2020-09-11 10:30:31          001
4 2020-09-14 13:15:15          002

.iloc[] takes a series of indices to get the appropriate rows. .iloc[]采用一系列索引来获取适当的行。 The pd.to_datetime statement may be needed to force the application_date to be a suitable datetime format for pd.Series.idxmax to work.可能需要pd.to_datetime语句来强制application_date成为适合pd.Series.idxmax工作的日期时间格式。

Answer 2

First, because this is sorting and select in the time data, you should convert your column to pandas date_time to be well operated on pandas by pd.to_datetime .首先，因为这是在时间数据中进行排序和选择，您应该将您的列转换为 pandas date_time 以便通过pd.to_datetime对 pandas 进行良好操作。

Then you can select the ['application_date'] by choosing the maximum value in the series of time by df['application_date'].agg(pd.Series.idxmax) .然后您可以通过df['application_date'].agg(pd.Series.idxmax)选择时间序列中的最大值来选择df['application_date'].agg(pd.Series.idxmax) 。 However, because you are looking for the latest time in different id or rank, you need to add a groupby to help the maximum selection for each id.但是，因为你是在不同的id或rank中寻找最新的时间，所以需要添加一个groupby来帮助每个id的最大选择。

df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)

If you want to select the application date: you can easily index them by iloc如果您想选择申请日期：您可以通过iloc轻松索引它们

df.iloc[df.groupby('candidate_id')['application_date'].agg(pd.Series.idxmax)]

Answer 3

I am a little late with this answer.我对这个答案有点晚了。 I happened to stumble upon this post while I was searching for something similar.我在寻找类似的东西时偶然发现了这篇文章。

This is what I usually do when I am trying to find the most recent record.这是我在尝试查找最新记录时通常会做的事情。

df['rank_application'] = df.groupby('candidate_id')['application_date'].rank(method='first', ascending=False)
df = df[df.rank_application == 1]

The initial approach posted in the question is what I follow.问题中发布的初始方法是我遵循的方法。

如何仅返回 Pandas 中的最新行？

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-09-16 00:59:18

解决方案2
1 2020-09-16 01:44:53

解决方案3
0 2021-09-19 06:01:13

如何仅返回 Pandas 中的最新行？

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-09-16 00:59:18

解决方案2 1 2020-09-16 01:44:53

解决方案3 0 2021-09-19 06:01:13

解决方案1
1 已采纳 2020-09-16 00:59:18

解决方案2
1 2020-09-16 01:44:53

解决方案3
0 2021-09-19 06:01:13