如何根据列的相等部分将切片分配给pandas数据框？

Question

I have sorted a roughly 1 million row dataframe by a certain column. 我按特定列对大约100万行数据帧进行了排序。 I would like to assign groups to each observation based on equal sums of another column but I'm not sure how to do this. 我想基于另一列的相等总和为每个观察分配组，但我不知道如何做到这一点。

Example below: 示例如下：

import pandas as pd
value1 = [25,27,20,22,28,20]
value2 = [.34,.43,.54,.43,.5,.7]

df = pd.DataFrame({'value1':value1,'value2':value2})

df.sort_values('value1', ascending = False)

df['wanted_result'] = [1,1,1,2,2,2]

Like this example, I want to sum my column (example column value1 ) and assign groups to have as close to equal value1 sums as they can. 像这个例子，我想总结我的列（示例列value1 ）并指定组尽可能接近相等的value1总和。 Is there a build in function to this? 这有功能吗？

Answer 1

Greedy Loop 贪婪的循环

Using Numba's JIT to quicken it up. 使用Numba的JIT来加速它。

from numba import njit

@njit
def partition(c, n):
    delta = c[-1] / n
    group = 1
    indices = [group]
    total = delta

    for left, right in zip(c, c[1:]):
        left_diff = total - left
        right_diff = total - right
        if right > total and abs(total - right) > abs(total - left):
            group += 1
            total += delta
        indices.append(group)

    return indices

df.assign(result=partition(df.value1.to_numpy().cumsum(), n=2))

   value1  value2  result
4      28    0.50       1
1      27    0.43       1
0      25    0.34       1
3      22    0.43       2
2      20    0.54       2
5      20    0.70       2

This is NOT optimal. 这不是最佳选择。 This is a greedy heuristic. 这是一种贪婪的启发式方法。 It goes through the list and finds where we step over to the next group. 它遍历列表并找到我们进入下一组的位置。 At that point it decides whether it's better to include the current point in the current group or the next group. 此时，它决定将当前点包含在当前组或下一组中是否更好。

This should behave pretty well except in cases with huge disparity in values with the larger values coming towards the end. 这应该表现得非常好，除非价值观存在巨大差异且价值较大的情况即将结束。 This is because this algorithm is greedy and only looks at what it knows at the moment and not everything at once. 这是因为这个算法很贪婪，只能查看它当前所知的内容，而不是一次查看所有内容。

But like I said, it should be good enough. 但就像我说的那样，它应该足够好了。

Answer 2

I think, this is a kind of optimalisation problem (non-linear) and Pandas is definitively not any good candidate to solve it. 我认为，这是一种最优化问题（非线性），而熊猫绝对不是解决问题的好方法。

The basic idea to solve the problem can be as follows: 解决问题的基本思路如下：

Definitions: 定义：
- n - number of elements, n - 元素数量，
- groupNo - the number of groups to divide into. groupNo - 要分组的组数。
Start from generating an initial solution , eg take consecutive groups of n / groupNo elements into each bin . 从生成初始解决方案开始 ，例如将连续的n / groupNo元素组放入每个bin中 。
Define the goal function , eg sum of squares of differences between sum of each group and sum of all elements / groupNo . 定义目标函数 ，例如，每个组的总和与所有元素/组的总和之间的差的平方和。
Perform an iteration: 执行迭代：
- for each pair of elements a and b from different bins, calculate the new goal function value, if these elements were moved to the other bin, 对于来自不同箱柜的每对元素a和b ，计算新的目标函数值，如果这些元素被移动到另一个箱柜，
- select the pair that gives the greater improvement of the goal function and perform the move (move a from its present bin to the bin, where b is, and vice versa). 选择能够更好地改进目标函数的对并执行移动（将a从其当前bin移动到bin，其中b是，反之亦然）。
If no such pair can be found, then we have the final result. 如果找不到这样的对，那么我们得到最终结果。

Maybe someone will propose a better solution, but at least this solution is some concept to start with. 也许有人会提出更好的解决方案，但至少这个解决方案是一个开始的概念。

如何根据列的相等部分将切片分配给pandas数据框？

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-06-18 16:31:15

Greedy Loop 贪婪的循环

解决方案2
1 2019-06-18 16:52:06

如何根据列的相等部分将切片分配给pandas数据框？

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-06-18 16:31:15

Greedy Loop 贪婪的循环

解决方案2 1 2019-06-18 16:52:06

解决方案1
2 已采纳 2019-06-18 16:31:15

解决方案2
1 2019-06-18 16:52:06