I had posted this question and need to expand on the application. I now need to get the N
max date for each Vendor
:
#import pandas as pd
#df = pd.read_clipboard()
#df['Insert_Date'] = pd.to_datetime(df['Insert_Date'])
# used in example below
#df2 = df.sort_values(['Vendor','InsertDate']).drop_duplicates(['Vendor'],keep='last')
Vendor Insert_Date Total
Steph 2017-10-25 2
Matt 2017-10-31 13
Chris 2017-11-03 3
Steve 2017-10-23 11
Chris 2017-10-27 3
Steve 2017-11-01 11
If I needed to get the 2nd max date expected output would be:
Vendor Insert_Date Total
Steph 2017-10-25 2
Steve 2017-10-23 11
Matt 2017-10-31 13
Chris 2017-10-27 3
I can easily get the 2nd max date by using df2
in the example df.loc[~df.index.isin(df2.index)]
but if i need to get the 50th max value, that is a lot of dataframe building to use isin()
...
I have also tried df.groupby('Vendor')['Insert_Date'].nlargest(N_HERE)
which gets me close, but i then need to get the N
value for each Vendor.
I have also tried filtering out the df by Vendor:
df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)
but if I try to get the second record with df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[2]
it returns: Timestamp('2017-11-03 00:00:00')
. Instead i need to use df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[1:2]
. Why must I use list slicing here and not simply [2]
?
In summary? how do I return the N
largest date by Vendor?
I might've misunderstood your initial problem. You can sort on Insert_Date
, and then use groupby
+ apply
in this manner:
n = 9
df.sort_values('Insert_Date')\
.groupby('Vendor', as_index=False).apply(lambda x: x.iloc[-n])
For your example data, it seems n = 0
does the trick.
df.sort_values('Insert_Date')\
.groupby('Vendor', as_index=False).apply(lambda x: x.iloc[0])
Vendor Insert_Date Total
0 Chris 2017-10-27 3
1 Matt 2017-10-31 13
2 Steph 2017-10-25 2
3 Steve 2017-10-23 11
Beware, this code will throw errors if your Vendor
groups are smaller in size than n
.
I will using head
(You can pick the top n here I am using 2) and always drop_duplicates
by the last.
df.sort_values('Insert_Date',ascending=False).groupby('Vendor').\
head(2).drop_duplicates('Vendor',keep='last').sort_index()
Out[609]:
Vendor Insert_Date Total
0 Steph 2017-10-25 2
1 Matt 2017-10-31 13
3 Steve 2017-10-23 11
4 Chris 2017-10-27 3
I like @COLDSPEED's answer as its more direct. Here is one using nlargest which involves an intermediate step of creating nthlargest column
n = 2
df1['nth_largest'] = df1.groupby('Vendor').Insert_Date.transform(lambda x: x.nlargest(n).min())
df1.drop_duplicates(subset = ['Vendor', 'nth_largest']).drop('Insert_Date', axis = 1)
Vendor Total nth_largest
0 Steph 2 2017-10-25
1 Matt 13 2017-10-31
2 Chris 3 2017-10-27
3 Steve 11 2017-10-23
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.