简体   繁体   English

确保四舍五入的百分比总和为每组 100(最大余数法)

[英]Ensure rounded percentages sum up to 100 per group (largest remainder method)

How can I update a column of weights, grouped by a unique name, in Pandas using the 'largest remainder method'?如何使用“最大余数法”更新 Pandas 中按唯一名称分组的权重列? I want the weights to add up to 100% after they are rounded to 2 decimal points.我希望权重在四舍五入到小数点后 2 位后加起来为 100%。

Input dataframe:输入 dataframe:

print(df)
     Name    Weight
0    John    33.3333
1    John    33.3333
2    John    33.3333
3    James   50
4    James   25
5    James   25
6    Kim     6.6666
5    Kim     93.3333
6    Jane    46.6666
7    Jane    6.6666
8    Jane    46.6666

Expected results:预期成绩:

print(df)
     Name    Weight   New Weight
0    John    3.3333   33.33    
1    John    3.3333   33.33
2    John    3.3333   33.34
3    James   50       50
4    James   25       25
5    James   25       25
6    Kim     6.6666   6.66
5    Kim     93.3333  93.34
6    Jane    46.6666  46.66
7    Jane    6.6666   6.67
8    Jane    46.6666  46.67

I've tried to apply the following functions:我尝试应用以下功能:

Python Percentage Rounding Python 百分比四舍五入

def round_to_100_percent(number_set, digit_after_decimal=2):
    """
        This function take a list of number and return a list of percentage, which represents the portion of each number in sum of all numbers
        Moreover, those percentages are adding up to 100%!!!
        Notice: the algorithm we are using here is 'Largest Remainder'
        The down-side is that the results won't be accurate, but they are never accurate anyway:)
    """
    unround_numbers = [x / float(sum(number_set)) * 100 * 10 ** digit_after_decimal for x in number_set]
    decimal_part_with_index = sorted([(index, unround_numbers[index] % 1) for index in range(len(unround_numbers))], key=lambda y: y[1], reverse=True)
    remainder = 100 * 10 ** digit_after_decimal - sum([int(x) for x in unround_numbers])
    index = 0
    while remainder > 0:
        unround_numbers[decimal_part_with_index[index][0]] += 1
        remainder -= 1
        index = (index + 1) % len(number_set)
    return [int(x) / float(10 ** digit_after_decimal) for x in unround_numbers]

Split (explode) pandas dataframe string entry to separate rows 拆分(分解)pandas dataframe 字符串条目到单独的行

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

This is what I tried so far:这是我到目前为止所尝试的:

new_column = df.groupby('Name')['Weight'].apply(round_to_100_percent)

#Merge new_column into main data frame
df = pd.merge(df, new_column, on='Name', how='outer')

#For some reason _y is added to col
df = df.explode('Weight_y')

df['New Weight'] = df['Weight_y']*0.01

It's not working in a couple of ways.它不是以几种方式工作的。 Sometimes there are more rows than the original dataframe.有时行数比原来的 dataframe 多。 Not sure why weight_y column is being created.不确定为什么要创建 weight_y 列。

Is there a better way to apply the largest remainder rounding to a Pandas column?是否有更好的方法将最大余数舍入应用于 Pandas 列?

Here is a simple approach to add the missing (remove the extra) difference to 100 in the last item of the group (you can update to another item if you like):这是一个简单的方法,可以在组的最后一项中将缺失的(删除额外的)差异添加到 100(如果您愿意,可以更新到另一个项目):

df['rounded'] = (df['Weight']
 .round(2)
 .groupby(df['Name'])
 .transform(lambda s: pd.Series({s.index[-1]: (100-s.iloc[:-1].sum()).round(2)})
                        .combine_first(s))
)

output: output:

    Name   Weight  rounded
0   John  33.3333    33.33
1   John  33.3333    33.33
2   John  33.3333    33.34
3  James  50.0000    50.00
4  James  25.0000    25.00
5  James  25.0000    25.00
6    Kim   6.6666     6.67
5    Kim  93.3333    93.33
6   Jane  46.6666    46.67
7   Jane   6.6666     6.67
8   Jane  46.6666    46.66

Checking the sum:检查总和:

df.groupby('Name')['rounded'].sum()

James    100.0
Jane     100.0
John     100.0
Kim      100.0
Name: rounded, dtype: float64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM