简体   繁体   中英

Replacing item values in a data frame on certain condition in other columns

I have a pandas data frame like this one:

dx1      dx2    dx3     dx4     dxpoa1  dxpoa2  dxpoa3  dxpoa4
25041   40391                   Y       E       
25041   40391   25081           N       W       U       
25041   40391   42822   99681   1       N       Y       Y 

There are two sets of columns: dx and dxpoa. Depending on certain values in dxpoa, I have to keep values in dx or discard it. Foe each value in dx there is a value in corresponding dxpoa in that row. For ex: If dxpoa = ['Y'or 'W' or '1' or 'E'] then keep dx value in corresponding row otherwise discard it or fill it with 0. Like dxpoa1, in first row, is 'Y' therefore dx1 will remain as it is. But dxpoa1, in second row, is 'N' therefore corresponding value of dx1, of second row, will become 0.

Given a dataframe built like so:

import pandas as pd
import numpy as np
df = pd.DataFrame({'dx1':[25041,25041,25041],
                   'dx2':[40391,40391,40391],
                   'dx3':[np.nan,25081,42822],
                   'dx4':[np.nan,np.nan,99681],
                   'dxpoa1':['Y','N','1'],
                   'dxpoa2':['E','W','N'],
                   'dxpoa3':[np.nan,'U','Y'],
                   'dxpoa4':[np.nan,np.nan,'Y']})

Which gives:

    dx1     dx2     dx3     dx4    dxpoa1   dxpoa2  dxpoa3  dxpoa4
0   25041   40391   NaN     NaN     Y       E       NaN     NaN
1   25041   40391   25081   NaN     N       W       U       NaN
2   25041   40391   42822   99681   1       N       Y       Y

Define a function that implements your substitution rules. This is replaces the target column with zero when the value in the reference column is not 'Y', 'W', '1' or 'E', as I understood from your description:

def subfunc(row,col_reference=None,col_target=None):
    if not row[col_reference] in ['Y','W','1','E']:
        row[col_target] = 0
    return row

Then iterate over the column names applying subfunc over each row:

for colname in df.columns:
    if 'dxpoa' in colname:
        colid = colname.split('dxpoa')[1]
        df = df.apply(subfunc,axis=1,col_reference=colname,col_target='dx'+colid)

Results in the dataframe

    dx1     dx2     dx3     dx4     dxpoa1  dxpoa2  dxpoa3  dxpoa4
0   25041   40391   0       0       Y       E       NaN     NaN
1   0       40391   0       0       N       W       U       NaN
2   25041   0       42822   99681   1       N       Y       Y

Here's a vectorized way of looking at it (using @vmg's handy starting frame):

>>> N = len(df.columns)
>>> keep = df.iloc[:,-N//2:].isin(["Y", "W", "1", "E"]).values
>>> df.iloc[:,:N//2] = df.iloc[:,:N//2].where(keep, 0)
>>> df
     dx1    dx2    dx3    dx4 dxpoa1 dxpoa2 dxpoa3 dxpoa4
0  25041  40391      0      0      Y      E    NaN    NaN
1      0  40391      0      0      N      W      U    NaN
2  25041      0  42822  99681      1      N      Y      Y

What this does is make an array of True and False for the last N//2 columns, with True where the value is in the list and False where it's not (note also that I'm assuming 1 is the string "1" and not the integer 1 ):

>>> df.iloc[:,-N//2:]
  dxpoa1 dxpoa2 dxpoa3 dxpoa4
0      Y      E    NaN    NaN
1      N      W      U    NaN
2      1      N      Y      Y
>>> df.iloc[:,-N//2:].isin(["Y", "W", "1", "E"])
  dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   True   True  False  False
1  False   True  False  False
2   True  False   True   True
>>> df.iloc[:,-N//2:].isin(["Y", "W", "1", "E"]).values
array([[ True,  True, False, False],
       [False,  True, False, False],
       [ True, False,  True,  True]], dtype=bool)

Then we can use where to set the value of the first N//2 columns, keeping the values where keep is True and otherwise replacing them with 0.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM