get list of percentages to equal 100

Question

I have data that shows the what weight of an ETF is held in a country. The issue is the data source has minor discrepancies in the weighting. For example for ETF VTI the sum of all percentages (USA+Canada) is 1.026, which means the total is approximately 102%.

The small percentage discrepancies are a problem when I plot or display the data is is a cosmetic/ visual issue when the graphs show totals that are either > or < then 100%

This is what the data looks like:

d = {'Name': [US, US, US, CA], 'Weight': [1, 1, 1.0197, 0.0009], 'ETF': [SPY, IVV, VTI, VTI]}
df = pd.DataFrame(data=d)
df
    Name   Weight     ETF
0     US     1        SPY
1     US     1        IVV
2     US     1.0197   VTI
3     CA     0.0009   VTI

I have written some code which tries to fix this below but I ran into another problem. The code I wrote looks at what the difference is between the real total and 100% and then adds or subtracts that difference across all values in the list as seen below. The problem is that when the percentage needs to be subtracted I end up with small but nonetheless negative values which is not desirable.

def re_weight(df):

     etfs= df['ETF'].unique()

     for etf in etfs: 


         l = (df[df['ETF']==etf].shape)[0]
         total = float(df[df['ETF']==etf]['Weight'].sum())
         diff = 1-total 

         filler = diff/l

         df.loc[df['ETF']==etf, 'Weight'] = df[df['ETF']==etf]['Weight']+filler


     return df

countries = pd.read_csv('output\\countries.csv')

countries[['Weight','ETF']] = re_weight(countries[['Weight','ETF']])

This is the output of the above code, everything now equal to 1, but I am stuck in certain places with negative percentage values.

df = pd.DataFrame(data=d)
df
    Name   Weight     ETF
0     US     1        SPY
1     US     1        IVV
2     US     1.0094   VTI
3     CA    -0.0094   VTI

How can I format the percentages so that they always total to 100% and that there are no negative values?

Answer 1

You can use groupby.transform here to the get "incorrect" sum next to each row and then divide by that amount make a correction. Like suggested by @ThierrLathuille in the comments:

print(df)
            Name  Weight  ETF
0  United States  1.0000  SPY
1  United States  1.0000  IVV
2  United States  1.0197  VTI
3         Canada  0.0009  VTI

Apply logic explained above

df['weight_recalc'] = df['Weight'] / df.groupby(['ETF']).Weight.transform('sum')
print(df)
            Name  Weight  ETF  weight_recalc
0  United States  1.0000  SPY       1.000000
1  United States  1.0000  IVV       1.000000
2  United States  1.0197  VTI       0.999118
3         Canada  0.0009  VTI       0.000882

Show recalc went correct

print(df.groupby('ETF').weight_recalc.sum())
ETF
IVV    1.0
SPY    1.0
VTI    1.0
Name: weight_recalc, dtype: float64

Answer 2

You don't need to add or substract something, because you'll change proportions by this method.

Let's imagine, that you have 3 data points:

US     40%
Canada 50%
Japan  30%

As you can see, total percent is 40+50+30 = 120%.

And proportions between different values are:

US / Canada = 40/50 = 0.8
US / Japan = 40/30 = 1.33333
Canada / Japan = 50/30 = 1.66666

Now, we get 120 - 100 = 20, and substract 1/3 of it from each data point, we would have:

US    33.33333
Canada  43.33333
Japan 23.33333

And proportions now are:

US / Canada = 33.3333/43.33333 = 0.769
US / Japan = 33.3333/23.3333 = 1.428
Canada / Japan = 43.33333/23.33333 = 1.857

See? Proportins have changed in unpredictable way.

So, to keep them right, you have to just align the scale of your data.

1) Summ all the values:

30+40+50 = 120

2) Divide 100 by result of the summ: 100/120 = 0.83333333

3) Multiply every value by previous result (0.8333333 in this case):

In this example, we'll get:

US     33.33333
Canada 41.66666
Japan  25

You can check, but I telling you, proportions didnt changed in this case, and summ now is equal 100 (with some rounding)

In pseudocode (I don't have much experience with pandas math library):

s = sum(df['ETF'])
df['ETF'] = df['ETF'] * 100 / s

get list of percentages to equal 100

Question

2 answers

solution1
2 ACCPTED 2019-03-30 18:44:10

solution2
1 2019-03-30 18:43:33

get list of percentages to equal 100

Question

2 answers

solution1 2 ACCPTED 2019-03-30 18:44:10

solution2 1 2019-03-30 18:43:33

solution1
2 ACCPTED 2019-03-30 18:44:10

solution2
1 2019-03-30 18:43:33