获取等于100的百分比列表

Question

I have data that shows the what weight of an ETF is held in a country. 我有数据显示一个国家持有ETF的权重。 The issue is the data source has minor discrepancies in the weighting. 问题在于数据源的权重差异较小。 For example for ETF VTI the sum of all percentages (USA+Canada) is 1.026, which means the total is approximately 102%. 例如，对于ETF VTI，所有百分比的总和（美国+加拿大）为1.026，这意味着总计约为102％。

The small percentage discrepancies are a problem when I plot or display the data is is a cosmetic/ visual issue when the graphs show totals that are either > or < then 100% 当我绘制或显示数据时，小的百分比差异是一个问题，当图形显示总计>或<然后是100％时，数据是一个外观/视觉问题。

This is what the data looks like: 数据如下所示：

d = {'Name': [US, US, US, CA], 'Weight': [1, 1, 1.0197, 0.0009], 'ETF': [SPY, IVV, VTI, VTI]}
df = pd.DataFrame(data=d)
df
    Name   Weight     ETF
0     US     1        SPY
1     US     1        IVV
2     US     1.0197   VTI
3     CA     0.0009   VTI

I have written some code which tries to fix this below but I ran into another problem. 我写了一些代码试图在下面解决这个问题，但是我遇到了另一个问题。 The code I wrote looks at what the difference is between the real total and 100% and then adds or subtracts that difference across all values in the list as seen below. 我编写的代码着眼于实际总数与100％之间的差异，然后在列表中的所有值之间添加或减去该差异，如下所示。 The problem is that when the percentage needs to be subtracted I end up with small but nonetheless negative values which is not desirable. 问题是，当需要减去百分比时，我最终得到的是小值，但仍然是负值，这是不希望的。

def re_weight(df):

     etfs= df['ETF'].unique()

     for etf in etfs: 


         l = (df[df['ETF']==etf].shape)[0]
         total = float(df[df['ETF']==etf]['Weight'].sum())
         diff = 1-total 

         filler = diff/l

         df.loc[df['ETF']==etf, 'Weight'] = df[df['ETF']==etf]['Weight']+filler


     return df

countries = pd.read_csv('output\\countries.csv')

countries[['Weight','ETF']] = re_weight(countries[['Weight','ETF']])

This is the output of the above code, everything now equal to 1, but I am stuck in certain places with negative percentage values. 这是上面代码的输出，现在所有内容都等于1，但是我在某些地方停留在负百分比值上。

df = pd.DataFrame(data=d)
df
    Name   Weight     ETF
0     US     1        SPY
1     US     1        IVV
2     US     1.0094   VTI
3     CA    -0.0094   VTI

How can I format the percentages so that they always total to 100% and that there are no negative values? 如何设置百分比的格式，使它们始终总计为100％，并且没有负值？

Answer 1

You can use groupby.transform here to the get "incorrect" sum next to each row and then divide by that amount make a correction. 您可以在此处使用groupby.transform来获取每行旁边的“不正确”总和，然后除以该数量即可得到更正。 Like suggested by @ThierrLathuille in the comments: 就像@ThierrLathuille在评论中建议的那样：

print(df)
            Name  Weight  ETF
0  United States  1.0000  SPY
1  United States  1.0000  IVV
2  United States  1.0197  VTI
3         Canada  0.0009  VTI

Apply logic explained above 应用上述逻辑

df['weight_recalc'] = df['Weight'] / df.groupby(['ETF']).Weight.transform('sum')
print(df)
            Name  Weight  ETF  weight_recalc
0  United States  1.0000  SPY       1.000000
1  United States  1.0000  IVV       1.000000
2  United States  1.0197  VTI       0.999118
3         Canada  0.0009  VTI       0.000882

Show recalc went correct 显示重新计算正确

print(df.groupby('ETF').weight_recalc.sum())
ETF
IVV    1.0
SPY    1.0
VTI    1.0
Name: weight_recalc, dtype: float64

Answer 2

You don't need to add or substract something, because you'll change proportions by this method. 您无需添加或减去某些内容，因为您可以通过此方法更改比例。

Let's imagine, that you have 3 data points: 假设您有3个数据点：

US     40%
Canada 50%
Japan  30%

As you can see, total percent is 40+50+30 = 120%. 如您所见，总百分比为40 + 50 + 30 = 120％。

And proportions between different values are: 不同值之间的比例为：

US / Canada = 40/50 = 0.8
US / Japan = 40/30 = 1.33333
Canada / Japan = 50/30 = 1.66666

Now, we get 120 - 100 = 20, and substract 1/3 of it from each data point, we would have: 现在，我们得到120-100 = 20，并从每个数据点中减去它的1/3，我们将得到：

US    33.33333
Canada  43.33333
Japan 23.33333

And proportions now are: 现在的比例是：

US / Canada = 33.3333/43.33333 = 0.769
US / Japan = 33.3333/23.3333 = 1.428
Canada / Japan = 43.33333/23.33333 = 1.857

See? 看到？ Proportins have changed in unpredictable way. 蛋白质的变化是无法预测的。

So, to keep them right, you have to just align the scale of your data. 因此，为了使其正确无误，您只需要调整数据规模即可。

1) Summ all the values: 1）汇总所有值：

30+40+50 = 120 30 + 40 + 50 = 120

2) Divide 100 by result of the summ: 100/120 = 0.83333333 2）将100除以求和结果：100/120 = 0.83333333

3) Multiply every value by previous result (0.8333333 in this case): 3）将每个值乘以先前的结果（在这种情况下为0.8333333）：

In this example, we'll get: 在此示例中，我们将获得：

US     33.33333
Canada 41.66666
Japan  25

You can check, but I telling you, proportions didnt changed in this case, and summ now is equal 100 (with some rounding) 您可以检查，但我告诉您，在这种情况下，比例没有变化，并且总和现在等于100（经过四舍五入）

In pseudocode (I don't have much experience with pandas math library): 用伪代码（我对熊猫数学库没有太多经验）：

s = sum(df['ETF'])
df['ETF'] = df['ETF'] * 100 / s

获取等于100的百分比列表

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-03-30 18:44:10

解决方案2
1 2019-03-30 18:43:33

获取等于100的百分比列表

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-03-30 18:44:10

解决方案2 1 2019-03-30 18:43:33

解决方案1
2 已采纳 2019-03-30 18:44:10

解决方案2
1 2019-03-30 18:43:33