![](/img/trans.png)
[英]Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas
[英]pandas creates new columns based on the values of the other columns in the same row
我有以下df
,
days days_1 days_2 period percent_1 percent_2 amount
3 5 4 1 0.2 0.1 100
2 1 3 4 0.3 0.1 500
9 8 10 6 0.4 0.2 600
10 7 8 11 0.5 0.3 700
10 5 6 7 0.7 0.4 800
我试图基于同一行中每列的值创建两个新列,分别称为amount_missed
和days_missed
,代码如下:
# init the two columns
df['amount_missed'] = 0.0
df['days_missed'] = 0
# iter through each row to get values for the new columns
# based on the other columns in the df
for row in df.itertuples():
if getattr(row, 'days') < getattr(row, 'days_1'):
df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
df.loc[getattr(row, 'Index'), 'days_missed'] = 0
elif getattr(row, 'days_2') < getattr(row, 'days') < getattr(row, 'period') \
or getattr(row, 'days') > getattr(row, 'period'):
missed_percent = getattr(row, 'percent_2')
df.loc[getattr(row, 'Index'), 'amount_missed'] = getattr(row, 'amount') \
* (missed_percent / 100)
df.loc[getattr(row, 'Index'), 'days_missed'] = getattr(row, 'days') \
- getattr(row, 'days_2')
else:
df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
df.loc[getattr(row, 'Index'), 'days_missed'] = 0
我想知道在pandas / numpy中是否有更简洁有效的方法。
更新结果df
看起来像,
{'amount': {0: 100, 1: 500, 2: 600, 3: 700, 4: 800},
'amount_missed': {0: 0.0, 1: 0.0, 2: 1.2, 3: 2.1, 4: 3.2},
'days': {0: 3, 1: 2, 2: 9, 3: 10, 4: 10},
'days_1': {0: 5, 1: 1, 2: 8, 3: 7, 4: 5},
'days_2': {0: 4, 1: 3, 2: 10, 3: 8, 4: 6},
'days_missed': {0: 0, 1: 0, 2: -1, 3: 2, 4: 4},
'percent_1': {0: 0.2, 1: 0.3, 2: 0.4, 3: 0.5, 4: 0.7},
'percent_2': {0: 0.1, 1: 0.1, 2: 0.2, 3: 0.3, 4: 0.4},
'period': {0: 1, 1: 4, 2: 6, 3: 11, 4: 7}}
无法在stackoverflow
正确格式化df
,因此必须to_dict
。
基于DYZ和Anton的答案的UPDATE 2,如果每行还有一种情况需要考虑,这会使原始代码看起来像这样,
for row in df.itertuples():
if getattr(row, 'days') < getattr(row, 'days_1'):
df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
df.loc[getattr(row, 'Index'), 'days_missed'] = 0
elif getattr(row, 'days_1') < getattr(row, 'days') < getattr(row, 'days_2'):
missed_percent = getattr(row,'percent_1') - getattr(row,'percent_2')
df.loc[getattr(row, 'Index'), 'amount'] = getattr(row, 'amount') * (missed_percent / 100)
df.loc[getattr(row, 'Index'), 'days_missed'] = getattr(row, 'days') - getattr(row, 'days_1')
elif getattr(row, 'days_2') < getattr(row, 'days') < getattr(row, 'period') \
or getattr(row, 'days') > getattr(row, 'period'):
missed_percent = getattr(row, 'percent_2')
df.loc[getattr(row, 'Index'), 'amount_missed'] = getattr(row, 'amount') \
* (missed_percent / 100)
df.loc[getattr(row, 'Index'), 'days_missed'] = getattr(row, 'days') \
- getattr(row, 'days_2')
else:
df.loc[getattr(row, 'Index'), 'amount_missed'] = 0
df.loc[getattr(row, 'Index'), 'days_missed'] = 0
使用下面建议的答案,我可以使它看起来像下面吗?
cond1 = df['days_2'] < df['days']
cond2 = df['days'] < df['period']
cond3 = df['days'] > df['period']
cond4 = df['days'] >= df['days_1'] # The negation of df['days'] < df['days_1']
cond5 = df['days'] < df['days_2']
cond6 = df['days'] > df['days_1']
mask = ((cond1 & cond2) | cond3) & cond4
mask2 = cond5 & cond6
df['amount_missed'] = np.where(mask, df['amount'] * df['percent_2'] / 100, 0.0)
df['amount_missed'] = np.where(mask2, df['amount'] * (df['percent_1'] - df['percent_2']) / 100, 0.0)
df['days_missed'] = np.where(mask, df['days'] - df['days_2'], 0)
df['days_missed'] = np.where(mask2, df['days'] -df['days_1'], 0)
这是将您的代码直接翻译成适当的Pandas。 通常,永远不要使用数据帧中的行循环。
# These rows are affected by the calculations
affected = ( ((df['days_2'] < df['days']) & (df['days'] < df['period']))\
|(df['days'] > df['period'])) \
&(df['days'] >= df['days_1']) # The negation of df['days'] < df['days_1']
# Explicitly insert non-zero calculated fields
df.loc[affected, 'amount_missed'] = df['amount'] * df['percent_2'] / 100
df.loc[affected, 'days_missed'] = df['days'] - df['days_2']
# Insert the missing zeros
df.fillna(0, inplace=True)
修改版本(Anton vbr):
import pandas as pd
import numpy as np
import io
data = '''\
days days_1 days_2 period percent_1 percent_2 amount
3 5 4 1 0.2 0.1 100
2 1 3 4 0.3 0.1 500
9 8 10 6 0.4 0.2 600
10 7 8 11 0.5 0.3 700
10 5 6 7 0.7 0.4 800'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
cond1 = df['days_2'] < df['days']
cond2 = df['days'] < df['period']
cond3 = df['days'] > df['period']
cond4 = df['days'] >= df['days_1'] # The negation of df['days'] < df['days_1']
mask = ((cond1 & cond2) | cond3) & cond4
df['amount_missed'] = np.where(mask, df['amount'] * df['percent_2'] / 100, 0.0)
df['days_missed'] = np.where(mask, df['days'] - df['days_2'], 0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.