[英]Pandas fillna by most recent date
I have a dataframe of reports on a given subject.我有一份关于给定主题的报告 dataframe。 Each report has a score and the subjects have scores for some dates but not others.每份报告都有一个分数,受试者在某些日期有分数,但在其他日期没有分数。 I'd like to create a new dataframe that only has the most recent score for each subject.我想创建一个新的 dataframe ,它只有每个主题的最新分数。 A MRE is below.下面是一个 MRE。 The original dataframe looks like this:原来的 dataframe 是这样的:
Subject Rpt_date Score_1 Score_2
0 L. Skywalker 2020-12-01 9.0 NaN
1 L. Skywalker 2020-12-06 NaN 8.0
2 L. Skywalker 2021-01-11 7.0 NaN
3 H. Solo 2020-11-19 NaN 7.0
4 H. Solo 2020-12-15 NaN 5.0
5 H. Solo 2021-01-26 4.0 NaN
6 L. Organa 2020-11-20 6.0 NaN
7 L. Organa 2020-12-01 NaN 6.0
8 L. Organa 2020-12-19 NaN 7.0
9 D. Djarin 2020-12-10 NaN 10.0
10 D. Djarin 2020-12-12 10.0 NaN
11 D. Djarin 2021-01-03 NaN 10.0
And the desired output looks like this:所需的 output 如下所示:
Subject Score_1 Score_2
0 L. Skywalker 7.0 8.0
1 H. Solo 4.0 5.0
2 L. Organa 6.0 7.0
3 D. Djarin 10.0 10.0
My MRE technically works, but seems like a kludge and would be very slow in a large dataframe.我的 MRE 在技术上是可行的,但在大型 dataframe 中似乎很慢。
import pandas as pd
import numpy as np
def merge_latest(df, col):
df_temp = df.dropna(subset=[col]).copy()
df_temp['Last_rpt'] = df_temp.groupby('Subject')['Rpt_date'].transform('max')
df_temp.drop(df_temp.loc[df_temp['Rpt_date'] != df_temp['Last_rpt']]
.index, inplace=True)
d = dict(zip(df_temp['Subject'], df_temp[c]))
return d
df_data = {'Subject':['L. Skywalker', 'L. Skywalker', 'L. Skywalker',
'H. Solo', 'H. Solo', 'H. Solo',
'L. Organa', 'L. Organa', 'L. Organa',
'D. Djarin', 'D. Djarin', 'D. Djarin'],
'Rpt_date':['12/1/2020', '12/6/2020', '1/11/2021',
'11/19/2020', '12/15/2020', '1/26/2021',
'11/20/2020', '12/1/2020', '12/19/2020',
'12/10/2020', '12/12/2020', '1/3/2021'],
'Score_1':[9, np.nan, 7,
np.nan, np.nan, 4,
6, np.nan, np.nan,
np.nan, 10, np.nan],
'Score_2':[np.nan, 8, np.nan,
7, 5, np.nan,
np.nan, 6, 7,
10, np.nan, 10]}
df = pd.DataFrame(data=df_data)
df['Rpt_date'] = pd.to_datetime(df['Rpt_date'])
print(df)
fin_df = pd.DataFrame()
fin_df['Subject'] = df['Subject'].unique()
for c in ['Score_1', 'Score_2']:
merge_dict = merge_latest(df, c)
fin_df[c] = fin_df['Subject'].map(merge_dict)
print(fin_df)
First sort using Report date and then groupby.last()
.首先使用报告日期排序,然后使用groupby.last()
。 This solution is vectorized.该解决方案是矢量化的。
df['Rpt_date'] = pd.to_datetime(df['Rpt_date'])
fin_df = df.sort_values('Rpt_date').groupby('Subject', as_index=False).last()
Output Output
Subject Rpt_date Score_1 Score_2
0 D. Djarin 2021-01-03 10.0 10.0
1 H. Solo 2021-01-26 4.0 5.0
2 L. Organa 2020-12-19 6.0 7.0
3 L. Skywalker 2021-01-11 7.0 8.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.