pandas - Getting multiple maximum values

Question

I have a dataframe looks like this:

indx   user_id     type        date
0      123          A Level-1  2021-01-15
1      123          A Level-1  2021-01-10
2      123          A Level-2  2021-01-10
3      123          B Level-2  2021-01-11
4      123          not_ctrgzd 2021-01-10
5      124          A Level-2  2021-02-11
6      124          B Level-1  2021-01-21
7      124          B Level-1+ 2021-02-11
8      125          not_ctrgzd 2021-01-31
9      126          A Level-1  2021-02-02
...

What I need is to get the rows with the most recent dates for each unique type, ie

indx   user_id     type        date
0      123          A Level-1  2021-01-15
2      123          A Level-2  2021-01-10
3      123          B Level-2  2021-01-11
4      123          not_ctrgzd 2021-01-10
5      124          A Level-2  2021-02-11
6      124          B Level-1  2021-01-21
7      124          B Level-1+ 2021-02-11
8      125          not_ctrgzd 2021-01-31
9      126          A Level-1  2021-02-02

And following code block is doing that

idx = df.groupby(['user_id','type'])['date'].transform(max) == df['date']
df[idx]

Now, what I can't do is to get the rows with max type value for each type( A , B and so on) so that in the end, dataframe looks like this.

indx   user_id     type        date
2      123          A Level-2  2021-01-10
3      123          B Level-2  2021-01-11
4      123          not_ctrgzd 2021-01-10
5      124          A Level-2  2021-02-11
7      124          B Level-1+ 2021-02-11
8      125          not_ctrgzd 2021-01-31
9      126          A Level-1  2021-02-02

Because B Level-1+ is greater than B Level-1 and A Level-2 is greater than A Level-1 and so on. Please notice that some rows have no categorized types( no_ctgrzd ) which should be included in the final dataframe no matter what. Please do not hesitate to correct any parts that does not looks reasonable to you like the title:). Thanks!

Answer 1

Exactly your approach - just derive value you are grouping by.

idx = df.groupby(['user_id',
                  np.where(df.type.str.match("[A,B][1,2]"), df.type.str.replace(r"([A-B])[1,2]",r"\1-", regex=True), df.type)]
                )['date'].transform(max) == df['date']
df[idx]

	idx	user_id	type	date
0	0	123	A1	2021-01-15 00:00:00
2	3	123	B2	2021-01-11 00:00:00
3	4	123	not_ctrgzd	2021-01-10 00:00:00
4	5	124	A2	2021-02-11 00:00:00
6	7	124	B1	2021-02-11 00:00:00
7	8	125	not_ctrgzd	2021-01-31 00:00:00
8	9	126	A1	2021-02-02 00:00:00

Answer 2

You could do it this way with pd.CategoricalDtype:

#Create a catoregy and order for type
catTypeDtype = pd.CategoricalDtype(['1','1+','2'], ordered=True)

#Split the type into two helper columns to sort on category
df[['t1','t2']] = df['type'].str.extract('(?P<t1>[AB]|(?:.*))(?P<t2>.*)')

#change dtype from string to categorical
df['t2'] = df['t2'].astype(catTypeDtype)

#Sort dataframe on categorical data and date
dfs = df.sort_values(['t2','date'], ascending=[False, False])

#Groupby and take the first record after sorting
df_out = dfs.groupby(['user_id','t1'], group_keys=False, as_index=False).first()\
            .drop(['t1','t2'], axis=1)

df_out

Output:

   user_id  indx        type        date
0      123     2          A2  2021-01-10
1      123     3          B2  2021-01-11
2      123     4  not_ctrgzd  2021-01-10
3      124     5          A2  2021-02-11
4      124     6          B2  2021-01-21
5      125     8  not_ctrgzd  2021-01-31
6      126     9          A1  2021-02-02

Update with new data

catTypeDtype = pd.CategoricalDtype(['1','1+','2'], ordered=True)

df[['t1','t2']] = df['type'].str.extract('(?P<t1>[AB]|(?:.*))(?:\sLevel-)?(?P<t2>.*)')
# df

df['t2'] = df['t2'].astype(catTypeDtype)

dfs = df.sort_values(['t2','date'], ascending=[False, False])

df_out = dfs.groupby(['user_id','t1'], group_keys=False, as_index=False).first()\
            .drop(['t1','t2'], axis=1)

Output:

   user_id  indx        type        date
0      123     2   A Level-2  2021-01-10
1      123     3   B Level-2  2021-01-11
2      123     4  not_ctrgzd  2021-01-10
3      124     5   A Level-2  2021-02-11
4      124     7  B Level-1+  2021-02-11
5      125     8  not_ctrgzd  2021-01-31
6      126     9   A Level-1  2021-02-02

pandas - Getting multiple maximum values

Question

2 answers

solution1
3 2021-02-11 18:04:50

solution2
2 ACCPTED 2021-02-11 18:22:22

Update with new data

pandas - Getting multiple maximum values

Question

2 answers

solution1 3 2021-02-11 18:04:50

solution2 2 ACCPTED 2021-02-11 18:22:22

Update with new data

solution1
3 2021-02-11 18:04:50

solution2
2 ACCPTED 2021-02-11 18:22:22