简体   繁体   中英

Changing weights to my index if column is missing

I have a pandas dataframe with different countries (rows) and 4 indicators (columns) A, B, C and D. For each indicator, I have a specific weight I use to calculate their weighted sum, let's say: Weigth_A = 0.2, Weigth_B = 0.2, Weight_C = 0.4 , Weight_D = 0.2

This is the formula for my weighted sum

df['W_Sum'] = Weigth_A*df['A'] + Weigth_B*df['B'] + Weigth_C*df['C'] + Weigth_D*df['D']

However, if a column is NaN (suppose D in this case), I need to change my weighted sum to a normal average;

df['W_Sum'] = 0.33*df['A'] + 0.33*df['B'] + 0.33*df['C'] 

If two are missing, then:

df['W_Sum'] = 0.5*df['A'] + 0.5*df['B']

is there a way to automize this process as I am not sure which column is going to have a missing value for each country?

thanks!

You can use np.where for this:

wa = 0.2*df.A + 0.4*df.B + 0.2*df.C
df['new_col'] = np.where(df.isna().any(axis=1), df.mean(axis=1), wa)

df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6], 'C':[7,8,np.nan]})

   A  B    C  
0  1  4  7.0      
1  2  5  8.0      
2  3  6  NaN      

wa = 0.2*df.A + 0.4*df.B + 0.2*df.C
df['new_col'] = np.where(df.isna().any(axis=1), df.mean(axis=1), wa)

   A  B    C  new_col
0  1  4  7.0      3.2
1  2  5  8.0      4.0
2  3  6  NaN      4.5

np.where will select among the mean or the weighted average depending on the result of the condition has_nans :

df.assign(has_nans = df.isna().any(axis=1), mean=df.mean(axis=1), weighted_av = wa)

   A  B    C  new_col  has_nans  mean  weighted_av
0  1  4  7.0      3.2     False  3.80          3.2
1  2  5  8.0      4.0     False  4.75          4.0
2  3  6  NaN      4.5      True  4.50          NaN

I was about to write basically the same answer as yatu but trying to be a little more efficient.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,np.nan],
                   'D':[1, np.nan, np.nan]})
weights = np.array([0.2,0.4,0.2,0.2])

df["w_avg"]= np.where(df.isnull().any(1),
                      df.mean(1),
                      np.dot(df.values, weights))

Given that there is no point calculating something you are not going to use.

With a dummy df using np.dot instead of calculate wa manually is better in terms of speed and generalization

n = 5000
df = pd.DataFrame({"A":np.random.rand(n),
                   "B": np.random.rand(n),
                   "C":np.random.rand(n),
                   "D":np.random.rand(n)})

%%timeit
wa = 0.2*df.A + 0.4*df.B + 0.2*df.C + 0.2* df.D
735 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%%timeit
wa = np.dot(df.values, weights)
18.9 µs ± 732 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM