简体   繁体   中英

Pandas Column based on values in other columns

Basically, I would like to fill in column Discount_Sub_Dpt with 'Yes' or 'No' depending on if there is a Discount for that Sub_Dpt for that week EXCLUDING the product on which that row lands (for instance I don't want any of the A rows to consider whether there is a Discount for that week for A but rather only for the products in that sub department(in most cases there is more than one other product).

I have tried using groupby with Sub_Dpt and Week to no avail.

Does anyone know how to solve this issue?

The Yellow column is obviously the desired outcome from the code.

CSV示例

Here is some of the code I have used, I am trying to create the column first and then update the values (but it could all potentially be wrong) (also I intentionally named the data frame df1):

  df1['Discount_Sub_Dpt'] = np.where((df1['Discount']=='Yes'),'Yes','No')

 grps = []                    
 grps.append(df1.Sub_Dpt.unique())
 for x in grps:
      x = str(x)
      yes_weeks = df1.loc[(df1.Discount_SubDpt == 'Yes') & (df1.Sub_Dpt_Description == x),'Week'].unique()        
  df1.loc[df1['Week'].isin(yes_weeks) & df1['Sub_Dpt_Description'] == x, 'Discount_SubDpt'] = 'Yes'

Okay, this might not scale well, but should be easy to read.

df1 = pd.DataFrame(data= [[ 'A',    1,  'Toys', 'Yes',  ],
[   'A',    2,  'Toys', 'No',   ],
[   'A',    3,  'Toys', 'No',   ],
[   'A',    4,  'Toys', 'Yes',  ],
[   'B',    1,  'Toys', 'No',   ],
[   'B',    2,  'Toys', 'Yes',  ],
[   'B',    3,  'Toys', 'No',   ],
[   'B',    4,  'Toys', 'Yes',  ],
[   'C',    1,  'Candy',    'No',   ],
[   'C',    2,  'Candy',    'No',   ],
[   'C',    3,  'Candy',    'Yes',  ],
[   'C',    4,  'Candy',    'Yes',  ],
[   'D',    1,  'Candy',    'No',   ],
[   'D',    2,  'Candy',    'No',   ],
[   'D',    3,  'Candy',    'No',   ],
[   'D',    4,  'Candy',    'No',   ],], columns=['Product', 'Week', 'Sub_Dpt',        'Discount'])
df2 = df1.set_index(['Product', 'Week', 'Sub_Dpt'])
products = df1.Product.unique()
df1['Discount_SubDpt'] = df1.apply(lambda x: 'Yes' if 'Yes' in df2.loc[(list(products[products != x['Product']]), x['Week'], x['Sub_Dpt']), 'Discount'].tolist() else 'No', axis=1)

The first step creates a Multindex Dataframe.

Next, we get the list of all products

Next, for each row, we take out the same week and Sub Department and remove the product.

In this list if there is a discount, we select 'Yes' else 'No'

Edit 1:

If you don't want to create another dataframe (save memory, but will be a bit slower)

df1['Discount_SubDpt'] = df1.apply(lambda x: 'Yes' if 'Yes' in df1.loc[(df1['Product'] != x['Product']) & (df1['Week'] == x['Week']) & (df1['Sub_Dpt'] == x['Sub_Dpt']), 'Discount'].tolist() else 'No', axis=1)

Ok, the following is a bit crazy, but it works pretty nicely, so listen up.

First, we are going to build a NetworkX graph as follows.

import networkx as nx
import numpy as np
import pandas as pd
G = nx.Graph()
Prods = df.Product.unique()
G.add_nodes_from(Prods)

We now add edges between our nodes (which are all of the products) whenever they belong to the same sub_dpt. In this case, since A and B share a dept, and C and D, do, we add edges AB and CD. If we had ABC in the same department, we would add AB, AC, BC. Confusing, I know, but just trust me on this one.

G.add_edges_from([('A','B'),('C','D')])

Now comes the fun part. We need to convert your Discount column from Yes/No to 1/0.

df['Disc2']=np.nan
df.loc[df['Discount']=='Yes','Disc2']=1
df.loc[df['Discount']=='No','Disc2']=0

Now we pivot the data

tab = df.pivot(index = 'Week',columns='Product',values = 'Disc2')

And now, we do this

tab = pd.DataFrame(np.dot(tab,nx.adjacency_matrix(G,Prods).todense()), columns=Prods,index=df.Week.unique())
tab[0].astype(bool)
df = df.merge(tab.unstack().reset_index(),left_on=['Product','Week'],right_on=['level_0','level_1'])
df['Discount_Sub_Dpt']=df[0]
print(df[['Product','Week','Sub_Dpt','Discount','Discount_Sub_Dpt']])

You may ask, why go through this trouble? Well, two reasons. First, its far more stable. The other answers can't handle all possible cases of your problem. Second, it's much faster than the other solutions. I hope this helped!

You can perform a GroupBy to map ('Week', 'Sub_Dpt') to lists of 'Product' only when Discount is "Yes".

Then use a list comprehension to check if any are on Discount apart from the product in question. Finally, map a Boolean series result to "Yes" / "No".

Data from @SahilPuri.

# GroupBy only when Discount == Yes
g = df1[df1['Discount'] == 'Yes'].groupby(['Week', 'Sub_Dpt'])['Product'].unique()

# calculate index by row
idx = df1.set_index(['Week', 'Sub_Dpt']).index

# construct list of Booleans according to criteria
L = [any(x for x in g.get(i, []) if x!=j) for i, j in zip(idx, df1['Product'])]

# map Boolean to strings
df1['Discount_SubDpt'] = pd.Series(L).map({True: 'Yes', False: 'No'})

print(df1)

   Product  Week Sub_Dpt Discount Discount_SubDpt
0        A     1    Toys      Yes              No
1        A     2    Toys       No             Yes
2        A     3    Toys       No              No
3        A     4    Toys      Yes             Yes
4        B     1    Toys       No             Yes
5        B     2    Toys      Yes              No
6        B     3    Toys       No              No
7        B     4    Toys      Yes             Yes
8        C     1   Candy       No              No
9        C     2   Candy       No              No
10       C     3   Candy      Yes              No
11       C     4   Candy      Yes              No
12       D     1   Candy       No              No
13       D     2   Candy       No              No
14       D     3   Candy       No             Yes
15       D     4   Candy       No             Yes

It's late, but here's a go. I used the sample df in the comments above.

df1['dis'] = df1['Discount'].apply(lambda x: 1 if x =="Yes" else 0)
df2 = df1.groupby(['Sub_Dpt','Week']).sum()
df2.reset_index(inplace = True)
df3 = pd.merge(df1,df2, left_on=['Sub_Dpt','Week'], right_on =['Sub_Dpt','Week'])
df3['Discount_Sb_Dpt'] = np.where(df3['dis_x'] < df3['dis_y'], 'Yes', 'No')
df3.sort_values(by=['Product'], inplace = True)
df3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM