[英]Pandas DataFrame - How to get most recent value for each column when grouped by another column
Alright guys, I'm stumped.好吧伙计们,我被难住了。 To be completely honest, I'm very new to manipulating dataframes using pandas.
老实说,我对使用熊猫操作数据帧很陌生。
Suppose I have the dataframe below where the most recent entry is at the top, in descending order (I've already accomplished that part in my program based off of the data I have available).假设我有下面的数据框,其中最近的条目位于顶部,按降序排列(我已经根据可用数据在我的程序中完成了该部分)。
We'll call it 'df_people' and it contains this data:我们将其称为“df_people”,它包含以下数据:
username first middle last
jschmoe joseph NaN schmoe
jdoe jane marie doe
jschmoe joseph michael schmoe
jdoe jane NaN doe
tuser test NaN user
I am trying to parse this down to only show the most recent valid data from each column based off of the 'username' column (or of course leave 'NaN' if there are no valid entries)我试图将其解析为仅显示基于“用户名”列的每一列的最新有效数据(或者,如果没有有效条目,当然保留“NaN”)
Expected result:预期结果:
username first middle last
jschmoe joseph michael schmoe
jdoe jane marie doe
tuser test NaN user
In my actual dataframe I will have anywhere from 5-100 columns and easily over 100k rows whenever I need to run this report.在我的实际数据框中,每当我需要运行此报告时,我都会有 5-100 列和轻松超过 10 万行。 While I don't expect anything to be super fast for what I'm trying to accomplish, I just wanted to give scale so you can understand how even small optimizations can make a big difference.
虽然我不指望什么是超级快就是我要完成的,我只是想给规模,使你能理解小的优化可怎么连有很大的不同。 Reliable results is always more important than having the report finish a few seconds faster!
可靠的结果总是比让报告快几秒钟完成更重要! Right now I have no results...so anything is better than that...
现在我没有结果......所以有什么比那更好......
I've tried out a ton of different combinations of things by scraping through this site and the pandas documentation, but I think my lack of knowledge on what all pandas is capable of is severely limiting here.我已经通过浏览本网站和熊猫文档尝试了大量不同的组合,但我认为我对所有熊猫的能力缺乏了解严重限制了这里。
Any recommendations or ideas would be appreciated!任何建议或想法将不胜感激!
>>> df.groupby('username', as_index=False).first()
username first middle last
0 jdoe jane marie doe
1 jschmoe joseph michael schmoe
2 tuser test NaN user
You can use drop_duplicates,您可以使用 drop_duplicates,
df.drop_duplicates(subset='username')
Or use groupby或者使用 groupby
df.groupby('username', sort=False).first().reset_index()
username first middle last
0 jschmoe joseph michael schmoe
1 jdoe jane marie doe
2 tuser test NaN user
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.