简体   繁体   English

Pandas DataFrame - 如何在按另一列分组时获取每列的最新值

[英]Pandas DataFrame - How to get most recent value for each column when grouped by another column

Alright guys, I'm stumped.好吧伙计们,我被难住了。 To be completely honest, I'm very new to manipulating dataframes using pandas.老实说,我对使用熊猫操作数据帧很陌生。

Suppose I have the dataframe below where the most recent entry is at the top, in descending order (I've already accomplished that part in my program based off of the data I have available).假设我有下面的数据框,其中最近的条目位于顶部,按降序排列(我已经根据可用数据在我的程序中完成了该部分)。

We'll call it 'df_people' and it contains this data:我们将其称为“df_people”,它包含以下数据:

username    first   middle      last
jschmoe     joseph  NaN         schmoe
jdoe        jane    marie       doe
jschmoe     joseph  michael     schmoe
jdoe        jane    NaN         doe
tuser       test    NaN         user

I am trying to parse this down to only show the most recent valid data from each column based off of the 'username' column (or of course leave 'NaN' if there are no valid entries)我试图将其解析为仅显示基于“用户名”列的每一列的最新有效数据(或者,如果没有有效条目,当然保留“NaN”)

Expected result:预期结果:

username    first   middle  last
jschmoe     joseph  michael schmoe
jdoe        jane    marie   doe
tuser       test    NaN     user

In my actual dataframe I will have anywhere from 5-100 columns and easily over 100k rows whenever I need to run this report.在我的实际数据框中,每当我需要运行此报告时,我都会有 5-100 列和轻松超过 10 万行。 While I don't expect anything to be super fast for what I'm trying to accomplish, I just wanted to give scale so you can understand how even small optimizations can make a big difference.虽然我不指望什么是超级快就是我要完成的,我只是想给规模,使你能理解小的优化可怎么连有很大的不同。 Reliable results is always more important than having the report finish a few seconds faster!可靠的结果总是比让报告快几秒钟完成更重要! Right now I have no results...so anything is better than that...现在我没有结果......所以有什么比那更好......

I've tried out a ton of different combinations of things by scraping through this site and the pandas documentation, but I think my lack of knowledge on what all pandas is capable of is severely limiting here.我已经通过浏览本网站和熊猫文档尝试了大量不同的组合,但我认为我对所有熊猫的能力缺乏了解严重限制了这里。

Any recommendations or ideas would be appreciated!任何建议或想法将不胜感激!

>>> df.groupby('username', as_index=False).first()
  username   first   middle    last
0     jdoe    jane    marie     doe
1  jschmoe  joseph  michael  schmoe
2    tuser    test      NaN    user

You can use drop_duplicates,您可以使用 drop_duplicates,

df.drop_duplicates(subset='username')

Or use groupby或者使用 groupby

df.groupby('username', sort=False).first().reset_index()

    username    first   middle  last
0   jschmoe     joseph  michael schmoe
1   jdoe        jane    marie   doe
2   tuser       test    NaN     user

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 按每个唯一列值的最近日期过滤 Pandas 数据框 - Filter Pandas dataframe by most recent date for each unique column value 考虑到使用 pandas 的 NaN 值,如何获取每列的最新值 - How to get the most recent value for each column considering the NaN values using pandas pandas - 获取由另一列索引的特定列的最新值(获取由另一列索引的特定列的最大值) - pandas - get most recent value of a particular column indexed by another column (get maximum value of a particular column indexed by another column) 当按熊猫中的另一列分组时如何返回 value_counts() - How to return value_counts() when grouped by another column in pandas 在分组的熊猫数据框中获取最多的值 - Get Most Occurring Value in a Grouped Pandas Dataframe Python Pandas - 过滤 pandas dataframe 以获取一列中具有最小值的行,以获取另一列中的每个唯一值 - Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column 获取 pandas dataframe 列中每个值的平均值 - get the mean of each value in a pandas dataframe column 如何将一个熊猫数据框的一列与另一个数据框的每一列相加? - How to sum a column of one pandas dataframe to each column of another dataframe? 如何获得通过Pandas Dataframe分组的大小的均值,并按另一列分组? - How to obtain the means of sizes of a grouping with a Pandas Dataframe, grouped by another column? Pandas dataframe:如何导出按另一列中的值分组的值列表列表 - Pandas dataframe: How to export a list of lists of values grouped by value in another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM