[英]how to groupby pandas dataframe on some condition
I have a pandas dataframe like following 我有一个像下面的熊猫数据框
buyer_id item_id order_id date
139 57 387 2015-12-28
140 9 388 2015-12-28
140 57 389 2015-12-28
36 9 390 2015-12-28
64 49 404 2015-12-29
146 49 405 2015-12-29
81 49 406 2015-12-29
140 80 407 2015-12-30
139 81 408 2015-12-30
There are lot of rows in above dataframe. 上面的数据框中有很多行。 What I am trying to achieve is, whether introducing new dishes driving my users to come back. 我想要实现的目标是,是否引入新菜式来吸引用户回头客。 item_id
is mapped to a dish name. item_id
映射到菜名。 What I want to see is if a specific user is ordering different dish on different day. 我要查看的是特定用户是否在不同的日期订购了不同的菜肴。 eg buyer_id 140 has ordered two dishes item_id (9,57) on 28th Dec and same buyer has ordered different dish (item_id = 80) on 30th Dec
Then I want to flag this user as 1
例如, buyer_id 140 has ordered two dishes item_id (9,57) on 28th Dec and same buyer has ordered different dish (item_id = 80) on 30th Dec
那么我想将此用户标记为1
How I am doing it in python is like this 我如何在python中做到这一点
item_wise_order.groupby(['date','buyer_id'])['item_id'].apply(lambda x:
x.tolist())
it gives me following output 它给了我以下输出
date buyer_id
2015-12-28 139 [57]
140 [9,57]
36 [9]
2015-12-29 64 [49]
146 [49]
81 [49]
2015-12-30 140 [80]
139 [81]
Desired output 所需的输出
buyer_id item_id order_id date flag
139 57 387 2015-12-28 1
140 9 388 2015-12-28 1
140 57 389 2015-12-28 1
36 9 390 2015-12-28 0
64 49 404 2015-12-29 0
146 49 405 2015-12-29 0
81 49 406 2015-12-29 0
140 80 407 2015-12-30 1
139 81 408 2015-12-30 1
Similar to Anton's answer, but using apply 与Anton的答案类似,但使用apply
users = df.groupby('buyer_id').apply(lambda r: r['item_id'].unique().shape[0] > 1 and
r['date'].unique().shape[0] > 1 )*1
df.set_index('buyer_id', inplace=True)
df['good_user'] = users
result: 结果:
item_id order_id date good_user
buyer_id
139 57 387 2015-12-28 1
140 9 388 2015-12-28 1
140 57 389 2015-12-28 1
36 9 390 2015-12-28 0
64 49 404 2015-12-29 0
146 49 405 2015-12-29 0
81 49 406 2015-12-29 0
140 80 407 2015-12-30 1
139 81 408 2015-12-30 1
EDIT because I thought of another case: suppose the data shows a buyer buys the same two (or more) goods on two different days. 编辑,因为我想到了另一种情况:假设数据显示买家在两个不同的日期购买相同的两个(或多个)商品。 Should this user be flagged as 1 or 0? 应该将此用户标记为1还是0? Because effectively, he/she does not actually choose anything different on the second date. 因为有效,他/她在第二天实际上没有选择任何不同的东西。 So take buyer 81 in the following table. 因此,请以下表中的买方81为准。 You see they only buy 49 and 50 on both dates. 您会看到他们在两个日期仅购买49和50。
buyer_id item_id order_id date
139 57 387 2015-12-28
140 9 388 2015-12-28
140 57 389 2015-12-28
36 9 390 2015-12-28
64 49 404 2015-12-29
146 49 405 2015-12-29
81 49 406 2015-12-29
140 80 407 2015-12-30
139 81 408 2015-12-30
81 50 406 2015-12-29
81 49 999 2015-12-30
81 50 999 2015-12-30
To accomodate this, here's what I came up with (kinda ugly but should work) 为了适应这一点,这是我想出的(有点丑陋但应该可以)
# this function is applied to all buyers
def find_good_buyers(buyer):
# which dates the buyer has made a purchase
buyer_dates = buyer.groupby('date')
# a string representing the unique items purchased at each date
items_on_date = buyer_dates.agg({'item_id': lambda x: '-'.join(x.unique())})
# if there is more than 1 combination of item_id, then it means that
# the buyer has purchased different things in different dates
# so this buyer must be flagged to 1
good_buyer = (len(items_on_date.groupby('item_id').groups) > 1) * 1
return good_buyer
df['item_id'] = df['item_id'].astype('S')
buyers = df.groupby('buyer_id')
good_buyer = buyers.apply(find_good_buyers)
df.set_index('buyer_id', inplace=True)
df['good_buyer'] = good_buyer
df.reset_index(inplace=True)
This works on buyer 81 setting it to 0 because once you group by date, both dates at which a purchase was made will have the same "49-50" combination of items purchased, hence the number of combinations = 1 and the buyer will be flagged 0. 这适用于买方81,将其设置为0,因为一旦您按日期分组,则购买的两个日期将具有相同的“ 49-50”购买项目组合,因此组合数量= 1,买方将标记为0。
You could groupby by buyer_id
, then aggregate column with np.unique
. 您可以按buyer_id
,然后将列与np.unique
聚合。 Then you'll get np.ndarrays
for rows where you have several dates and item_ids. 然后你会得到np.ndarrays
的,你有几个日期和ITEM_IDS行。 You could find that rows with isinstance
of np.ndarray
and you'll get bool series which you could pass to aggregated dataframe and find interested buyer. 你可以发现,行与isinstance
的np.ndarray
,你会得到布尔一系列你可以传递到聚合数据帧,并找到兴趣的买家。 By filtering original dataframe with obtained buyers
you could fill rows for flag
with loc
: 通过与获得的buyers
过滤原始数据框,您可以使用loc
填充flag
行:
df_agg = df.groupby('buyer_id')[['date', 'item_id']].agg(np.unique)
df_agg = df_agg.applymap(lambda x: isinstance(x, np.ndarray))
buyers = df_agg[(df_agg['date']) & (df_agg['item_id'])].index
mask = df['buyer_id'].isin(buyers)
df['flag'] = 0
df.loc[mask, 'flag'] = 1
In [124]: df
Out[124]:
buyer_id item_id order_id date flag
0 139 57 387 2015-12-28 1
1 140 9 388 2015-12-28 1
2 140 57 389 2015-12-28 1
3 36 9 390 2015-12-28 0
4 64 49 404 2015-12-29 0
5 146 49 405 2015-12-29 0
6 81 49 406 2015-12-29 0
7 140 80 407 2015-12-30 1
8 139 81 408 2015-12-30 1
Output from first and second steps: 第一步和第二步的输出:
In [146]: df.groupby('buyer_id')[['date', 'item_id']].agg(np.unique)
Out[146]:
date item_id
buyer_id
36 2015-12-28 9
64 2015-12-29 49
81 2015-12-29 49
139 [2015-12-28, 2015-12-30] [57, 81]
140 [2015-12-28, 2015-12-30] [9, 57, 80]
146 2015-12-29 49
In [148]: df_agg.applymap(lambda x: isinstance(x, np.ndarray))
Out[148]:
date item_id
buyer_id
36 False False
64 False False
81 False False
139 True True
140 True True
146 False False
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.