简体   繁体   中英

Pandas DataFrame - How to get most recent value for each column when grouped by another column

Alright guys, I'm stumped. To be completely honest, I'm very new to manipulating dataframes using pandas.

Suppose I have the dataframe below where the most recent entry is at the top, in descending order (I've already accomplished that part in my program based off of the data I have available).

We'll call it 'df_people' and it contains this data:

username    first   middle      last
jschmoe     joseph  NaN         schmoe
jdoe        jane    marie       doe
jschmoe     joseph  michael     schmoe
jdoe        jane    NaN         doe
tuser       test    NaN         user

I am trying to parse this down to only show the most recent valid data from each column based off of the 'username' column (or of course leave 'NaN' if there are no valid entries)

Expected result:

username    first   middle  last
jschmoe     joseph  michael schmoe
jdoe        jane    marie   doe
tuser       test    NaN     user

In my actual dataframe I will have anywhere from 5-100 columns and easily over 100k rows whenever I need to run this report. While I don't expect anything to be super fast for what I'm trying to accomplish, I just wanted to give scale so you can understand how even small optimizations can make a big difference. Reliable results is always more important than having the report finish a few seconds faster! Right now I have no results...so anything is better than that...

I've tried out a ton of different combinations of things by scraping through this site and the pandas documentation, but I think my lack of knowledge on what all pandas is capable of is severely limiting here.

Any recommendations or ideas would be appreciated!

>>> df.groupby('username', as_index=False).first()
  username   first   middle    last
0     jdoe    jane    marie     doe
1  jschmoe  joseph  michael  schmoe
2    tuser    test      NaN    user

You can use drop_duplicates,

df.drop_duplicates(subset='username')

Or use groupby

df.groupby('username', sort=False).first().reset_index()

    username    first   middle  last
0   jschmoe     joseph  michael schmoe
1   jdoe        jane    marie   doe
2   tuser       test    NaN     user

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM