简体   繁体   中英

how to groupby pandas dataframe on some condition

I have a pandas dataframe like following

buyer_id item_id order_id        date
   139      57      387     2015-12-28
   140       9      388     2015-12-28
   140      57      389     2015-12-28
   36        9      390     2015-12-28
   64       49      404     2015-12-29
   146      49      405     2015-12-29
   81       49      406     2015-12-29
   140      80      407     2015-12-30
   139      81      408     2015-12-30

There are lot of rows in above dataframe. What I am trying to achieve is, whether introducing new dishes driving my users to come back. item_id is mapped to a dish name. What I want to see is if a specific user is ordering different dish on different day. eg buyer_id 140 has ordered two dishes item_id (9,57) on 28th Dec and same buyer has ordered different dish (item_id = 80) on 30th Dec Then I want to flag this user as 1

How I am doing it in python is like this

item_wise_order.groupby(['date','buyer_id'])['item_id'].apply(lambda x:    
x.tolist())

it gives me following output

date        buyer_id
2015-12-28  139                 [57]
            140                 [9,57]     
            36                  [9]
2015-12-29  64                  [49]
            146                 [49]
            81                  [49]
2015-12-30  140                 [80]
            139                 [81]

Desired output

 buyer_id item_id order_id        date    flag
   139      57      387     2015-12-28     1
   140       9      388     2015-12-28     1
   140      57      389     2015-12-28     1
   36        9      390     2015-12-28     0
   64       49      404     2015-12-29     0 
   146      49      405     2015-12-29     0
   81       49      406     2015-12-29     0
   140      80      407     2015-12-30     1
   139      81      408     2015-12-30     1 

Similar to Anton's answer, but using apply

users = df.groupby('buyer_id').apply(lambda r: r['item_id'].unique().shape[0] > 1 and 
                                               r['date'].unique().shape[0] > 1 )*1
df.set_index('buyer_id', inplace=True)
df['good_user'] = users

result:

          item_id  order_id        date  good_user
buyer_id
139            57       387  2015-12-28          1
140             9       388  2015-12-28          1
140            57       389  2015-12-28          1
36              9       390  2015-12-28          0
64             49       404  2015-12-29          0
146            49       405  2015-12-29          0
81             49       406  2015-12-29          0
140            80       407  2015-12-30          1
139            81       408  2015-12-30          1

EDIT because I thought of another case: suppose the data shows a buyer buys the same two (or more) goods on two different days. Should this user be flagged as 1 or 0? Because effectively, he/she does not actually choose anything different on the second date. So take buyer 81 in the following table. You see they only buy 49 and 50 on both dates.

    buyer_id   item_id order_id    date
         139        57      387    2015-12-28
         140         9      388    2015-12-28
         140        57      389    2015-12-28
          36         9      390    2015-12-28
          64        49      404    2015-12-29
         146        49      405    2015-12-29
          81        49      406    2015-12-29
         140        80      407    2015-12-30
         139        81      408    2015-12-30
          81        50      406    2015-12-29
          81        49      999    2015-12-30
          81        50      999    2015-12-30

To accomodate this, here's what I came up with (kinda ugly but should work)

# this function is applied to all buyers
def find_good_buyers(buyer):
    # which dates the buyer has made a purchase
    buyer_dates = buyer.groupby('date')
    # a string representing the unique items purchased at each date
    items_on_date = buyer_dates.agg({'item_id': lambda x: '-'.join(x.unique())})
    # if there is more than 1 combination of item_id, then it means that
    # the buyer has purchased different things in different dates
    # so this buyer must be flagged to 1
    good_buyer = (len(items_on_date.groupby('item_id').groups) > 1) * 1
    return good_buyer


df['item_id'] = df['item_id'].astype('S')
buyers = df.groupby('buyer_id') 

good_buyer = buyers.apply(find_good_buyers)
df.set_index('buyer_id', inplace=True)
df['good_buyer'] = good_buyer
df.reset_index(inplace=True)

This works on buyer 81 setting it to 0 because once you group by date, both dates at which a purchase was made will have the same "49-50" combination of items purchased, hence the number of combinations = 1 and the buyer will be flagged 0.

You could groupby by buyer_id , then aggregate column with np.unique . Then you'll get np.ndarrays for rows where you have several dates and item_ids. You could find that rows with isinstance of np.ndarray and you'll get bool series which you could pass to aggregated dataframe and find interested buyer. By filtering original dataframe with obtained buyers you could fill rows for flag with loc :

df_agg = df.groupby('buyer_id')[['date', 'item_id']].agg(np.unique)
df_agg = df_agg.applymap(lambda x: isinstance(x, np.ndarray))

buyers = df_agg[(df_agg['date']) & (df_agg['item_id'])].index
mask = df['buyer_id'].isin(buyers)

df['flag'] = 0
df.loc[mask, 'flag'] = 1

In [124]: df
Out[124]: 
   buyer_id  item_id  order_id        date  flag
0       139       57       387  2015-12-28     1
1       140        9       388  2015-12-28     1
2       140       57       389  2015-12-28     1
3        36        9       390  2015-12-28     0
4        64       49       404  2015-12-29     0
5       146       49       405  2015-12-29     0
6        81       49       406  2015-12-29     0
7       140       80       407  2015-12-30     1
8       139       81       408  2015-12-30     1

Output from first and second steps:

In [146]: df.groupby('buyer_id')[['date', 'item_id']].agg(np.unique)
Out[146]: 
                              date      item_id
buyer_id                                       
36                      2015-12-28            9
64                      2015-12-29           49
81                      2015-12-29           49
139       [2015-12-28, 2015-12-30]     [57, 81]
140       [2015-12-28, 2015-12-30]  [9, 57, 80]
146                     2015-12-29           49

In [148]: df_agg.applymap(lambda x: isinstance(x, np.ndarray))
Out[148]: 
           date item_id
buyer_id               
36        False   False
64        False   False
81        False   False
139        True    True
140        True    True
146       False   False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM