[英]How do I assign tiles to a pandas data frame based on equal parts of a column?
I have sorted a roughly 1 million row dataframe by a certain column. 我按特定列对大约100万行数据帧进行了排序。 I would like to assign groups to each observation based on equal sums of another column but I'm not sure how to do this. 我想基于另一列的相等总和为每个观察分配组,但我不知道如何做到这一点。
Example below: 示例如下:
import pandas as pd
value1 = [25,27,20,22,28,20]
value2 = [.34,.43,.54,.43,.5,.7]
df = pd.DataFrame({'value1':value1,'value2':value2})
df.sort_values('value1', ascending = False)
df['wanted_result'] = [1,1,1,2,2,2]
Like this example, I want to sum my column (example column value1
) and assign groups to have as close to equal value1 sums as they can. 像这个例子,我想总结我的列(示例列value1
)并指定组尽可能接近相等的value1总和。 Is there a build in function to this? 这有功能吗?
Using Numba's JIT to quicken it up. 使用Numba的JIT来加速它。
from numba import njit
@njit
def partition(c, n):
delta = c[-1] / n
group = 1
indices = [group]
total = delta
for left, right in zip(c, c[1:]):
left_diff = total - left
right_diff = total - right
if right > total and abs(total - right) > abs(total - left):
group += 1
total += delta
indices.append(group)
return indices
df.assign(result=partition(df.value1.to_numpy().cumsum(), n=2))
value1 value2 result
4 28 0.50 1
1 27 0.43 1
0 25 0.34 1
3 22 0.43 2
2 20 0.54 2
5 20 0.70 2
This is NOT optimal. 这不是最佳选择。 This is a greedy heuristic. 这是一种贪婪的启发式方法。 It goes through the list and finds where we step over to the next group. 它遍历列表并找到我们进入下一组的位置。 At that point it decides whether it's better to include the current point in the current group or the next group. 此时,它决定将当前点包含在当前组或下一组中是否更好。
This should behave pretty well except in cases with huge disparity in values with the larger values coming towards the end. 这应该表现得非常好,除非价值观存在巨大差异且价值较大的情况即将结束。 This is because this algorithm is greedy and only looks at what it knows at the moment and not everything at once. 这是因为这个算法很贪婪,只能查看它当前所知的内容,而不是一次查看所有内容。
But like I said, it should be good enough. 但就像我说的那样,它应该足够好了。
I think, this is a kind of optimalisation problem (non-linear) and Pandas is definitively not any good candidate to solve it. 我认为,这是一种最优化问题(非线性),而熊猫绝对不是解决问题的好方法。
The basic idea to solve the problem can be as follows: 解决问题的基本思路如下:
Definitions: 定义:
Start from generating an initial solution , eg take consecutive groups of n / groupNo elements into each bin . 从生成初始解决方案开始 ,例如将连续的n / groupNo元素组放入每个bin中 。
Define the goal function , eg sum of squares of differences between sum of each group and sum of all elements / groupNo . 定义目标函数 ,例如,每个组的总和与所有元素/组的总和之间的差的平方和。
Perform an iteration: 执行迭代:
If no such pair can be found, then we have the final result. 如果找不到这样的对,那么我们得到最终结果。
Maybe someone will propose a better solution, but at least this solution is some concept to start with. 也许有人会提出更好的解决方案,但至少这个解决方案是一个开始的概念。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.