Pandas DataFrame - 如何在按另一列分组时获取每列的最新值

Question

Alright guys, I'm stumped.好吧伙计们，我被难住了。 To be completely honest, I'm very new to manipulating dataframes using pandas.老实说，我对使用熊猫操作数据帧很陌生。

Suppose I have the dataframe below where the most recent entry is at the top, in descending order (I've already accomplished that part in my program based off of the data I have available).假设我有下面的数据框，其中最近的条目位于顶部，按降序排列（我已经根据可用数据在我的程序中完成了该部分）。

We'll call it 'df_people' and it contains this data:我们将其称为“df_people”，它包含以下数据：

username    first   middle      last
jschmoe     joseph  NaN         schmoe
jdoe        jane    marie       doe
jschmoe     joseph  michael     schmoe
jdoe        jane    NaN         doe
tuser       test    NaN         user

I am trying to parse this down to only show the most recent valid data from each column based off of the 'username' column (or of course leave 'NaN' if there are no valid entries)我试图将其解析为仅显示基于“用户名”列的每一列的最新有效数据（或者，如果没有有效条目，当然保留“NaN”）

Expected result:预期结果：

username    first   middle  last
jschmoe     joseph  michael schmoe
jdoe        jane    marie   doe
tuser       test    NaN     user

In my actual dataframe I will have anywhere from 5-100 columns and easily over 100k rows whenever I need to run this report.在我的实际数据框中，每当我需要运行此报告时，我都会有 5-100 列和轻松超过 10 万行。 While I don't expect anything to be super fast for what I'm trying to accomplish, I just wanted to give scale so you can understand how even small optimizations can make a big difference.虽然我不指望什么是超级快就是我要完成的，我只是想给规模，使你能理解小的优化可怎么连有很大的不同。 Reliable results is always more important than having the report finish a few seconds faster!可靠的结果总是比让报告快几秒钟完成更重要！ Right now I have no results...so anything is better than that...现在我没有结果......所以有什么比那更好......

I've tried out a ton of different combinations of things by scraping through this site and the pandas documentation, but I think my lack of knowledge on what all pandas is capable of is severely limiting here.我已经通过浏览本网站和熊猫文档尝试了大量不同的组合，但我认为我对所有熊猫的能力缺乏了解严重限制了这里。

Any recommendations or ideas would be appreciated!任何建议或想法将不胜感激！

Answer 1

>>> df.groupby('username', as_index=False).first()
  username   first   middle    last
0     jdoe    jane    marie     doe
1  jschmoe  joseph  michael  schmoe
2    tuser    test      NaN    user

Answer 2

You can use drop_duplicates,您可以使用 drop_duplicates，

df.drop_duplicates(subset='username')

Or use groupby或者使用 groupby

df.groupby('username', sort=False).first().reset_index()

    username    first   middle  last
0   jschmoe     joseph  michael schmoe
1   jdoe        jane    marie   doe
2   tuser       test    NaN     user

Pandas DataFrame - 如何在按另一列分组时获取每列的最新值

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-02-20 23:42:11

解决方案2
0 2020-02-20 23:42:51

Pandas DataFrame - 如何在按另一列分组时获取每列的最新值

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-02-20 23:42:11

解决方案2 0 2020-02-20 23:42:51

解决方案1
0 已采纳 2020-02-20 23:42:11

解决方案2
0 2020-02-20 23:42:51