I have a datframe that looks like this
user_id product_id created_at
1 100 2019-04-21 20:20:00
1 100 2019-04-23 00:10:00
1 200 2019-05-24 10:00:00
1 200 2020-06-24 10:10:24
2 100 2019-01-22 21:10:00
2 200 2019-04-25 20:23:30
2 300 2021-01-21 10:20:00
3 400 2019-12-21 10:20:00
3 400 2021-04-21 10:20:00
I'm trying to find the most bought product_id
for each user_id
. I know I can use groupby along with value_counts
or pd.Series.mode
to get this. something like this:
df.groupby(['user_id'])['product_id',].apply(lambda x: x.value_counts().index[0]).reset_index()
This works fine until I reach a user_id
having two or more product_id
with the same count( like in the sample dataframe). When this happens, ideally I'd want to select the product_id
with the most recent created_at
. So for eg. for user_id
1, I'd want to select product_id
200 as it was last bought more recently.
What would be the most optimal way to achieve this?
You can calculate both count
and max
on dates, then sort on these values and drop duplicates (or use groupby().head()):
s = df.groupby(['user_id','product_id'])['created_at'].agg(['count','max'])
s.sort_values(['count','max'], ascending=False).groupby('user_id').head(1)
Output:
count max
user_id product_id
3 400 2 2021-04-21 10:20:00
1 200 2 2020-06-24 10:10:24
2 300 1 2021-01-21 10:20:00
df.\
sort_values("created_at", ascending=False).\
groupby(["user_id", "product_id"], sort=False, as_index=False).\
count().\
groupby("user_id", as_index=False).\
head(1)
# user_id product_id created_at
# 0 1 200 2
# 1 2 300 1
# 2 3 400 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.