简体   繁体   中英

Get N Largest Date Pandas

I had posted this question and need to expand on the application. I now need to get the N max date for each Vendor :

#import pandas as pd
#df = pd.read_clipboard()
#df['Insert_Date'] = pd.to_datetime(df['Insert_Date'])

# used in example below 
#df2 = df.sort_values(['Vendor','InsertDate']).drop_duplicates(['Vendor'],keep='last') 

Vendor  Insert_Date Total 
Steph   2017-10-25  2
Matt    2017-10-31  13
Chris   2017-11-03  3
Steve   2017-10-23  11
Chris   2017-10-27  3
Steve   2017-11-01  11

If I needed to get the 2nd max date expected output would be:

Vendor  Insert_Date Total 
Steph   2017-10-25  2
Steve   2017-10-23  11
Matt    2017-10-31  13
Chris   2017-10-27  3

I can easily get the 2nd max date by using df2 in the example df.loc[~df.index.isin(df2.index)] but if i need to get the 50th max value, that is a lot of dataframe building to use isin() ...

I have also tried df.groupby('Vendor')['Insert_Date'].nlargest(N_HERE) which gets me close, but i then need to get the N value for each Vendor.

I have also tried filtering out the df by Vendor:

df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)

but if I try to get the second record with df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[2] it returns: Timestamp('2017-11-03 00:00:00') . Instead i need to use df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[1:2] . Why must I use list slicing here and not simply [2] ?

In summary? how do I return the N largest date by Vendor?

I might've misunderstood your initial problem. You can sort on Insert_Date , and then use groupby + apply in this manner:

n = 9
df.sort_values('Insert_Date')\
          .groupby('Vendor', as_index=False).apply(lambda x: x.iloc[-n])

For your example data, it seems n = 0 does the trick.

df.sort_values('Insert_Date')\
      .groupby('Vendor', as_index=False).apply(lambda x: x.iloc[0])

  Vendor Insert_Date  Total
0  Chris  2017-10-27      3
1   Matt  2017-10-31     13
2  Steph  2017-10-25      2
3  Steve  2017-10-23     11

Beware, this code will throw errors if your Vendor groups are smaller in size than n .

I will using head (You can pick the top n here I am using 2) and always drop_duplicates by the last.

df.sort_values('Insert_Date',ascending=False).groupby('Vendor').\
     head(2).drop_duplicates('Vendor',keep='last').sort_index()
Out[609]: 
  Vendor Insert_Date  Total
0  Steph  2017-10-25      2
1   Matt  2017-10-31     13
3  Steve  2017-10-23     11
4  Chris  2017-10-27      3

I like @COLDSPEED's answer as its more direct. Here is one using nlargest which involves an intermediate step of creating nthlargest column

n = 2
df1['nth_largest'] = df1.groupby('Vendor').Insert_Date.transform(lambda x: x.nlargest(n).min())
df1.drop_duplicates(subset = ['Vendor', 'nth_largest']).drop('Insert_Date', axis = 1)


    Vendor  Total   nth_largest
0   Steph   2   2017-10-25
1   Matt    13  2017-10-31
2   Chris   3   2017-10-27
3   Steve   11  2017-10-23

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM