I have a dataframe looks like this:
indx user_id type date
0 123 A Level-1 2021-01-15
1 123 A Level-1 2021-01-10
2 123 A Level-2 2021-01-10
3 123 B Level-2 2021-01-11
4 123 not_ctrgzd 2021-01-10
5 124 A Level-2 2021-02-11
6 124 B Level-1 2021-01-21
7 124 B Level-1+ 2021-02-11
8 125 not_ctrgzd 2021-01-31
9 126 A Level-1 2021-02-02
...
What I need is to get the rows with the most recent dates for each unique type, ie
indx user_id type date
0 123 A Level-1 2021-01-15
2 123 A Level-2 2021-01-10
3 123 B Level-2 2021-01-11
4 123 not_ctrgzd 2021-01-10
5 124 A Level-2 2021-02-11
6 124 B Level-1 2021-01-21
7 124 B Level-1+ 2021-02-11
8 125 not_ctrgzd 2021-01-31
9 126 A Level-1 2021-02-02
And following code block is doing that
idx = df.groupby(['user_id','type'])['date'].transform(max) == df['date']
df[idx]
Now, what I can't do is to get the rows with max type value for each type( A
, B
and so on) so that in the end, dataframe looks like this.
indx user_id type date
2 123 A Level-2 2021-01-10
3 123 B Level-2 2021-01-11
4 123 not_ctrgzd 2021-01-10
5 124 A Level-2 2021-02-11
7 124 B Level-1+ 2021-02-11
8 125 not_ctrgzd 2021-01-31
9 126 A Level-1 2021-02-02
Because B Level-1+ is greater than B Level-1 and A Level-2 is greater than A Level-1 and so on. Please notice that some rows have no categorized types( no_ctgrzd
) which should be included in the final dataframe no matter what. Please do not hesitate to correct any parts that does not looks reasonable to you like the title:). Thanks!
Exactly your approach - just derive value you are grouping by.
idx = df.groupby(['user_id',
np.where(df.type.str.match("[A,B][1,2]"), df.type.str.replace(r"([A-B])[1,2]",r"\1-", regex=True), df.type)]
)['date'].transform(max) == df['date']
df[idx]
idx | user_id | type | date | |
---|---|---|---|---|
0 | 0 | 123 | A1 | 2021-01-15 00:00:00 |
2 | 3 | 123 | B2 | 2021-01-11 00:00:00 |
3 | 4 | 123 | not_ctrgzd | 2021-01-10 00:00:00 |
4 | 5 | 124 | A2 | 2021-02-11 00:00:00 |
6 | 7 | 124 | B1 | 2021-02-11 00:00:00 |
7 | 8 | 125 | not_ctrgzd | 2021-01-31 00:00:00 |
8 | 9 | 126 | A1 | 2021-02-02 00:00:00 |
You could do it this way with pd.CategoricalDtype:
#Create a catoregy and order for type
catTypeDtype = pd.CategoricalDtype(['1','1+','2'], ordered=True)
#Split the type into two helper columns to sort on category
df[['t1','t2']] = df['type'].str.extract('(?P<t1>[AB]|(?:.*))(?P<t2>.*)')
#change dtype from string to categorical
df['t2'] = df['t2'].astype(catTypeDtype)
#Sort dataframe on categorical data and date
dfs = df.sort_values(['t2','date'], ascending=[False, False])
#Groupby and take the first record after sorting
df_out = dfs.groupby(['user_id','t1'], group_keys=False, as_index=False).first()\
.drop(['t1','t2'], axis=1)
df_out
Output:
user_id indx type date
0 123 2 A2 2021-01-10
1 123 3 B2 2021-01-11
2 123 4 not_ctrgzd 2021-01-10
3 124 5 A2 2021-02-11
4 124 6 B2 2021-01-21
5 125 8 not_ctrgzd 2021-01-31
6 126 9 A1 2021-02-02
catTypeDtype = pd.CategoricalDtype(['1','1+','2'], ordered=True)
df[['t1','t2']] = df['type'].str.extract('(?P<t1>[AB]|(?:.*))(?:\sLevel-)?(?P<t2>.*)')
# df
df['t2'] = df['t2'].astype(catTypeDtype)
dfs = df.sort_values(['t2','date'], ascending=[False, False])
df_out = dfs.groupby(['user_id','t1'], group_keys=False, as_index=False).first()\
.drop(['t1','t2'], axis=1)
Output:
user_id indx type date
0 123 2 A Level-2 2021-01-10
1 123 3 B Level-2 2021-01-11
2 123 4 not_ctrgzd 2021-01-10
3 124 5 A Level-2 2021-02-11
4 124 7 B Level-1+ 2021-02-11
5 125 8 not_ctrgzd 2021-01-31
6 126 9 A Level-1 2021-02-02
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.