简体   繁体   English

如何在某些情况下对熊猫数据框进行分组

[英]how to groupby pandas dataframe on some condition

I have a pandas dataframe like following 我有一个像下面的熊猫数据框

buyer_id item_id order_id        date
   139      57      387     2015-12-28
   140       9      388     2015-12-28
   140      57      389     2015-12-28
   36        9      390     2015-12-28
   64       49      404     2015-12-29
   146      49      405     2015-12-29
   81       49      406     2015-12-29
   140      80      407     2015-12-30
   139      81      408     2015-12-30

There are lot of rows in above dataframe. 上面的数据框中有很多行。 What I am trying to achieve is, whether introducing new dishes driving my users to come back. 我想要实现的目标是,是否引入新菜式来吸引用户回头客。 item_id is mapped to a dish name. item_id映射到菜名。 What I want to see is if a specific user is ordering different dish on different day. 我要查看的是特定用户是否在不同的日期订购了不同的菜肴。 eg buyer_id 140 has ordered two dishes item_id (9,57) on 28th Dec and same buyer has ordered different dish (item_id = 80) on 30th Dec Then I want to flag this user as 1 例如, buyer_id 140 has ordered two dishes item_id (9,57) on 28th Dec and same buyer has ordered different dish (item_id = 80) on 30th Dec那么我想将此用户标记为1

How I am doing it in python is like this 我如何在python中做到这一点

item_wise_order.groupby(['date','buyer_id'])['item_id'].apply(lambda x:    
x.tolist())

it gives me following output 它给了我以下输出

date        buyer_id
2015-12-28  139                 [57]
            140                 [9,57]     
            36                  [9]
2015-12-29  64                  [49]
            146                 [49]
            81                  [49]
2015-12-30  140                 [80]
            139                 [81]

Desired output 所需的输出

 buyer_id item_id order_id        date    flag
   139      57      387     2015-12-28     1
   140       9      388     2015-12-28     1
   140      57      389     2015-12-28     1
   36        9      390     2015-12-28     0
   64       49      404     2015-12-29     0 
   146      49      405     2015-12-29     0
   81       49      406     2015-12-29     0
   140      80      407     2015-12-30     1
   139      81      408     2015-12-30     1 

Similar to Anton's answer, but using apply 与Anton的答案类似,但使用apply

users = df.groupby('buyer_id').apply(lambda r: r['item_id'].unique().shape[0] > 1 and 
                                               r['date'].unique().shape[0] > 1 )*1
df.set_index('buyer_id', inplace=True)
df['good_user'] = users

result: 结果:

          item_id  order_id        date  good_user
buyer_id
139            57       387  2015-12-28          1
140             9       388  2015-12-28          1
140            57       389  2015-12-28          1
36              9       390  2015-12-28          0
64             49       404  2015-12-29          0
146            49       405  2015-12-29          0
81             49       406  2015-12-29          0
140            80       407  2015-12-30          1
139            81       408  2015-12-30          1

EDIT because I thought of another case: suppose the data shows a buyer buys the same two (or more) goods on two different days. 编辑,因为我想到了另一种情况:假设数据显示买家在两个不同的日期购买相同的两个(或多个)商品。 Should this user be flagged as 1 or 0? 应该将此用户标记为1还是0? Because effectively, he/she does not actually choose anything different on the second date. 因为有效,他/她在第二天实际上没有选择任何不同的东西。 So take buyer 81 in the following table. 因此,请以下表中的买方81为准。 You see they only buy 49 and 50 on both dates. 您会看到他们在两个日期仅购买49和50。

    buyer_id   item_id order_id    date
         139        57      387    2015-12-28
         140         9      388    2015-12-28
         140        57      389    2015-12-28
          36         9      390    2015-12-28
          64        49      404    2015-12-29
         146        49      405    2015-12-29
          81        49      406    2015-12-29
         140        80      407    2015-12-30
         139        81      408    2015-12-30
          81        50      406    2015-12-29
          81        49      999    2015-12-30
          81        50      999    2015-12-30

To accomodate this, here's what I came up with (kinda ugly but should work) 为了适应这一点,这是我想出的(有点丑陋但应该可以)

# this function is applied to all buyers
def find_good_buyers(buyer):
    # which dates the buyer has made a purchase
    buyer_dates = buyer.groupby('date')
    # a string representing the unique items purchased at each date
    items_on_date = buyer_dates.agg({'item_id': lambda x: '-'.join(x.unique())})
    # if there is more than 1 combination of item_id, then it means that
    # the buyer has purchased different things in different dates
    # so this buyer must be flagged to 1
    good_buyer = (len(items_on_date.groupby('item_id').groups) > 1) * 1
    return good_buyer


df['item_id'] = df['item_id'].astype('S')
buyers = df.groupby('buyer_id') 

good_buyer = buyers.apply(find_good_buyers)
df.set_index('buyer_id', inplace=True)
df['good_buyer'] = good_buyer
df.reset_index(inplace=True)

This works on buyer 81 setting it to 0 because once you group by date, both dates at which a purchase was made will have the same "49-50" combination of items purchased, hence the number of combinations = 1 and the buyer will be flagged 0. 这适用于买方81,将其设置为0,因为一旦您按日期分组,则购买的两个日期将具有相同的“ 49-50”购买项目组合,因此组合数量= 1,买方将标记为0。

You could groupby by buyer_id , then aggregate column with np.unique . 您可以按buyer_id ,然后将列与np.unique聚合。 Then you'll get np.ndarrays for rows where you have several dates and item_ids. 然后你会得到np.ndarrays的,你有几个日期和ITEM_IDS行。 You could find that rows with isinstance of np.ndarray and you'll get bool series which you could pass to aggregated dataframe and find interested buyer. 你可以发现,行与isinstancenp.ndarray ,你会得到布尔一系列你可以传递到聚合数据帧,并找到兴趣的买家。 By filtering original dataframe with obtained buyers you could fill rows for flag with loc : 通过与获得的buyers过滤原始数据框,您可以使用loc填充flag行:

df_agg = df.groupby('buyer_id')[['date', 'item_id']].agg(np.unique)
df_agg = df_agg.applymap(lambda x: isinstance(x, np.ndarray))

buyers = df_agg[(df_agg['date']) & (df_agg['item_id'])].index
mask = df['buyer_id'].isin(buyers)

df['flag'] = 0
df.loc[mask, 'flag'] = 1

In [124]: df
Out[124]: 
   buyer_id  item_id  order_id        date  flag
0       139       57       387  2015-12-28     1
1       140        9       388  2015-12-28     1
2       140       57       389  2015-12-28     1
3        36        9       390  2015-12-28     0
4        64       49       404  2015-12-29     0
5       146       49       405  2015-12-29     0
6        81       49       406  2015-12-29     0
7       140       80       407  2015-12-30     1
8       139       81       408  2015-12-30     1

Output from first and second steps: 第一步和第二步的输出:

In [146]: df.groupby('buyer_id')[['date', 'item_id']].agg(np.unique)
Out[146]: 
                              date      item_id
buyer_id                                       
36                      2015-12-28            9
64                      2015-12-29           49
81                      2015-12-29           49
139       [2015-12-28, 2015-12-30]     [57, 81]
140       [2015-12-28, 2015-12-30]  [9, 57, 80]
146                     2015-12-29           49

In [148]: df_agg.applymap(lambda x: isinstance(x, np.ndarray))
Out[148]: 
           date item_id
buyer_id               
36        False   False
64        False   False
81        False   False
139        True    True
140        True    True
146       False   False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM