[英]Extracting Datafram columns using regular expression from a big text file using python or R
[英]replacing value in specific columns in datafram by using python
这是计算证据权重的代码
#好是零坏是一
df = pd.concat([df[the_categroical_name], My_target], axis = 1)
df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
df = df.iloc[:, [0, 1, 3]]
df.columns = [df.columns.values[0], 'Number_of_observation', 'Probation_good_taxPayer']
df['prop_Number_of_observation'] = df['Number_of_observation'] / df['Number_of_observation'].sum()
df['N_good'] = df['Probation_good_taxPayer'] * df['Number_of_observation']
df['n_bad'] = (1 - df['Probation_good_taxPayer']) * df['Number_of_observation']
df['prop_n_good'] = df['N_good'] / df['N_good'].sum()
df['prop_of_bad'] = df['n_bad'] / df['n_bad'].sum()
df['WoE'] = np.log(df['prop_n_good'] / df['prop_of_bad'])
df['PD']= ((df['N_good'])/(df['n_bad'] + df['N_good']))
df = df.sort_values(['WoE'])
df = df.reset_index(drop = True)
#df['diff_Probation_good_taxPayer'] = df['Probation_good_taxPayer'].diff().abs()
#df['diff_WoE'] = df['WoE'].diff().abs()
df['IV'] = (df['prop_n_good'] - df['prop_of_bad']) * df['WoE']
df['IV'] = df['IV'].sum()
return df
df_BUSINESS_CATEGORY = Weight_of_evidance(df_input, 'BUSINESS_CATEGORY', df_Label)
# We execute the function we defined with the necessary arguments: a dataframe, a string, and a dataframe.
# We store the result in a dataframe.
df_BUSINESS_CATEGORY
所以现在,如果我想用它们在列 Woe 中的值替换 business_category 中的任何值,例如 A 是 -0978021 stc,现在我正在使用如下代码中的 for 循环
def flag_df_ISIC_4_ARAB(df_input):
if (df_input['BUSINESS_CATEGORY'] == 'A'):
return '-0.978021'
elif (df_input['BUSINESS_CATEGORY'] == 'اB'):
return '-0.977854'
elif (df_input['BUSINESS_CATEGORY'] == 'C'):
return '0.082918'
elif (df_input['BUSINESS_CATEGORY'] == 'D'):
return '0.772306'
elif (df_input['BUSINESS_CATEGORY'] == 'H'):
return '-0.176700'
elif (df_input['BUSINESS_CATEGORY'] == 'أخرى'):
return '0.955446'
else:
return '0'
df_input['BUSINESS_CATEGORY'] = df_input.apply(flag_df_ISIC_4_ARAB, axis = 1).astype(str)```
is there another way to replace the Woe with out using for loop
首先创建字典,传递给Series.map
并将不匹配的值替换为'0'
:
d = {'A':'-0.978021','اB':'-0.977854', 'C':'0.082918',
'D':'0.772306', 'H': '-0.176700', 'أخرى': '0.955446'}
df_input['BUSINESS_CATEGORY'] = df_input['BUSINESS_CATEGORY'].map(d).fillna('0')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.