I have data that shows the what weight of an ETF is held in a country. The issue is the data source has minor discrepancies in the weighting. For example for ETF VTI the sum of all percentages (USA+Canada) is 1.026, which means the total is approximately 102%.
The small percentage discrepancies are a problem when I plot or display the data is is a cosmetic/ visual issue when the graphs show totals that are either > or < then 100%
This is what the data looks like:
d = {'Name': [US, US, US, CA], 'Weight': [1, 1, 1.0197, 0.0009], 'ETF': [SPY, IVV, VTI, VTI]}
df = pd.DataFrame(data=d)
df
Name Weight ETF
0 US 1 SPY
1 US 1 IVV
2 US 1.0197 VTI
3 CA 0.0009 VTI
I have written some code which tries to fix this below but I ran into another problem. The code I wrote looks at what the difference is between the real total and 100% and then adds or subtracts that difference across all values in the list as seen below. The problem is that when the percentage needs to be subtracted I end up with small but nonetheless negative values which is not desirable.
def re_weight(df):
etfs= df['ETF'].unique()
for etf in etfs:
l = (df[df['ETF']==etf].shape)[0]
total = float(df[df['ETF']==etf]['Weight'].sum())
diff = 1-total
filler = diff/l
df.loc[df['ETF']==etf, 'Weight'] = df[df['ETF']==etf]['Weight']+filler
return df
countries = pd.read_csv('output\\countries.csv')
countries[['Weight','ETF']] = re_weight(countries[['Weight','ETF']])
This is the output of the above code, everything now equal to 1, but I am stuck in certain places with negative percentage values.
df = pd.DataFrame(data=d)
df
Name Weight ETF
0 US 1 SPY
1 US 1 IVV
2 US 1.0094 VTI
3 CA -0.0094 VTI
How can I format the percentages so that they always total to 100% and that there are no negative values?
You can use groupby.transform
here to the get "incorrect" sum next to each row and then divide by that amount make a correction. Like suggested by @ThierrLathuille in the comments:
print(df)
Name Weight ETF
0 United States 1.0000 SPY
1 United States 1.0000 IVV
2 United States 1.0197 VTI
3 Canada 0.0009 VTI
Apply logic explained above
df['weight_recalc'] = df['Weight'] / df.groupby(['ETF']).Weight.transform('sum')
print(df)
Name Weight ETF weight_recalc
0 United States 1.0000 SPY 1.000000
1 United States 1.0000 IVV 1.000000
2 United States 1.0197 VTI 0.999118
3 Canada 0.0009 VTI 0.000882
Show recalc went correct
print(df.groupby('ETF').weight_recalc.sum())
ETF
IVV 1.0
SPY 1.0
VTI 1.0
Name: weight_recalc, dtype: float64
You don't need to add or substract something, because you'll change proportions by this method.
Let's imagine, that you have 3 data points:
US 40%
Canada 50%
Japan 30%
As you can see, total percent is 40+50+30 = 120%.
And proportions between different values are:
US / Canada = 40/50 = 0.8
US / Japan = 40/30 = 1.33333
Canada / Japan = 50/30 = 1.66666
Now, we get 120 - 100 = 20, and substract 1/3 of it from each data point, we would have:
US 33.33333
Canada 43.33333
Japan 23.33333
And proportions now are:
US / Canada = 33.3333/43.33333 = 0.769
US / Japan = 33.3333/23.3333 = 1.428
Canada / Japan = 43.33333/23.33333 = 1.857
See? Proportins have changed in unpredictable way.
So, to keep them right, you have to just align the scale of your data.
1) Summ all the values:
30+40+50 = 120
2) Divide 100 by result of the summ: 100/120 = 0.83333333
3) Multiply every value by previous result (0.8333333 in this case):
In this example, we'll get:
US 33.33333
Canada 41.66666
Japan 25
You can check, but I telling you, proportions didnt changed in this case, and summ now is equal 100 (with some rounding)
In pseudocode (I don't have much experience with pandas math library):
s = sum(df['ETF'])
df['ETF'] = df['ETF'] * 100 / s
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.