简体   繁体   中英

pandas creates new columns based on the values of the other columns in the same row

I have the following df ,

days    days_1    days_2    period    percent_1   percent_2    amount
3       5         4         1         0.2         0.1         100
2       1         3         4         0.3         0.1         500
9       8         10        6         0.4         0.2         600
10      7         8         11        0.5         0.3         700
10      5         6         7         0.7         0.4         800        

and I am trying to create two new columns called amount_missed and days_missed based on the values of each column in the same row, the code is like,

# init the two columns  
df['amount_missed'] = 0.0
df['days_missed'] = 0
# iter through each row to get values for the new columns 
# based on the other columns in the df
for row in df.itertuples():
    if getattr(row, 'days') < getattr(row, 'days_1'):
        df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
        df.loc[getattr(row, 'Index'), 'days_missed'] = 0
    elif getattr(row, 'days_2') < getattr(row, 'days') < getattr(row, 'period') \
      or getattr(row, 'days') > getattr(row, 'period'):    
        missed_percent = getattr(row, 'percent_2')
        df.loc[getattr(row, 'Index'), 'amount_missed'] = getattr(row, 'amount') \
                                                      * (missed_percent / 100)
        df.loc[getattr(row, 'Index'), 'days_missed'] = getattr(row, 'days') \
                                                     - getattr(row, 'days_2')
    else:
        df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
        df.loc[getattr(row, 'Index'), 'days_missed'] = 0

I am wondering if there are more concise and efficient ways to do it in pandas/numpy.

UPDATE the result df looks like,

{'amount': {0: 100, 1: 500, 2: 600, 3: 700, 4: 800},
 'amount_missed': {0: 0.0, 1: 0.0, 2: 1.2, 3: 2.1, 4: 3.2},
 'days': {0: 3, 1: 2, 2: 9, 3: 10, 4: 10},
 'days_1': {0: 5, 1: 1, 2: 8, 3: 7, 4: 5},
 'days_2': {0: 4, 1: 3, 2: 10, 3: 8, 4: 6},
 'days_missed': {0: 0, 1: 0, 2: -1, 3: 2, 4: 4},
 'percent_1': {0: 0.2, 1: 0.3, 2: 0.4, 3: 0.5, 4: 0.7},
 'percent_2': {0: 0.1, 1: 0.1, 2: 0.2, 3: 0.3, 4: 0.4},
 'period': {0: 1, 1: 4, 2: 6, 3: 11, 4: 7}}

cannot format the df properly in stackoverflow , so had to to_dict .

UPDATE 2 based on DYZ and Anton's answer, if there is one more case to consider for each row, which makes the original code look like,

for row in df.itertuples():
    if getattr(row, 'days') < getattr(row, 'days_1'):
        df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
        df.loc[getattr(row, 'Index'), 'days_missed'] = 0
    elif getattr(row, 'days_1') < getattr(row, 'days') < getattr(row, 'days_2'):
        missed_percent = getattr(row,'percent_1') - getattr(row,'percent_2')
        df.loc[getattr(row, 'Index'), 'amount'] = getattr(row, 'amount') * (missed_percent / 100)
        df.loc[getattr(row, 'Index'), 'days_missed'] = getattr(row, 'days') - getattr(row, 'days_1')    
    elif getattr(row, 'days_2') < getattr(row, 'days') < getattr(row, 'period') \
      or getattr(row, 'days') > getattr(row, 'period'):    
        missed_percent = getattr(row, 'percent_2')
        df.loc[getattr(row, 'Index'), 'amount_missed'] = getattr(row, 'amount') \
                                                  * (missed_percent / 100)
        df.loc[getattr(row, 'Index'), 'days_missed'] = getattr(row, 'days') \
                                                 - getattr(row, 'days_2')
    else:
        df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
        df.loc[getattr(row, 'Index'), 'days_missed'] = 0    

using the answer suggested below, can I make it look like the following?

cond1 = df['days_2'] < df['days']
cond2 = df['days'] < df['period']
cond3 = df['days'] > df['period']
cond4 = df['days'] >= df['days_1'] # The negation of df['days'] < df['days_1']
cond5 = df['days'] < df['days_2']
cond6 = df['days'] > df['days_1']

mask = ((cond1 & cond2) | cond3) & cond4
mask2 = cond5 & cond6

df['amount_missed'] = np.where(mask, df['amount'] * df['percent_2'] / 100, 0.0)
df['amount_missed'] = np.where(mask2, df['amount'] * (df['percent_1'] - df['percent_2']) / 100, 0.0)

df['days_missed'] = np.where(mask, df['days'] - df['days_2'], 0)
df['days_missed'] = np.where(mask2, df['days'] -df['days_1'], 0)

Here's a direct translation of your code into proper Pandas. In general, you should never use loops by rows in a dataframe.

# These rows are affected by the calculations
affected = ( ((df['days_2'] < df['days']) & (df['days'] < df['period']))\
            |(df['days'] > df['period'])) \
          &(df['days'] >= df['days_1']) # The negation of df['days'] < df['days_1']

# Explicitly insert non-zero calculated fields
df.loc[affected, 'amount_missed'] = df['amount'] * df['percent_2'] / 100
df.loc[affected, 'days_missed'] = df['days'] - df['days_2']

# Insert the missing zeros
df.fillna(0, inplace=True)

Modified version (Anton vbr):

import pandas as pd
import numpy as np
import io

data = '''\
days    days_1    days_2    period    percent_1   percent_2    amount
3       5         4         1         0.2         0.1         100
2       1         3         4         0.3         0.1         500
9       8         10        6         0.4         0.2         600
10      7         8         11        0.5         0.3         700
10      5         6         7         0.7         0.4         800'''

df = pd.read_csv(io.StringIO(data), sep='\s+')

cond1 = df['days_2'] < df['days']
cond2 = df['days'] < df['period']
cond3 = df['days'] > df['period']
cond4 = df['days'] >= df['days_1'] # The negation of df['days'] < df['days_1']

mask = ((cond1 & cond2) | cond3) & cond4

df['amount_missed'] = np.where(mask, df['amount'] * df['percent_2'] / 100, 0.0)
df['days_missed'] = np.where(mask, df['days'] - df['days_2'], 0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM