简体   繁体   English

如何根据列的相等部分将切片分配给pandas数据框?

[英]How do I assign tiles to a pandas data frame based on equal parts of a column?

I have sorted a roughly 1 million row dataframe by a certain column. 我按特定列对大约100万行数据帧进行了排序。 I would like to assign groups to each observation based on equal sums of another column but I'm not sure how to do this. 我想基于另一列的相等总和为每个观察分配组,但我不知道如何做到这一点。

Example below: 示例如下:

import pandas as pd
value1 = [25,27,20,22,28,20]
value2 = [.34,.43,.54,.43,.5,.7]

df = pd.DataFrame({'value1':value1,'value2':value2})

df.sort_values('value1', ascending = False)

df['wanted_result'] = [1,1,1,2,2,2]

Like this example, I want to sum my column (example column value1 ) and assign groups to have as close to equal value1 sums as they can. 像这个例子,我想总结我的列(示例列value1 )并指定组尽可能接近相等的value1总和。 Is there a build in function to this? 这有功能吗?

Greedy Loop 贪婪的循环

Using Numba's JIT to quicken it up. 使用Numba的JIT来加速它。

from numba import njit

@njit
def partition(c, n):
    delta = c[-1] / n
    group = 1
    indices = [group]
    total = delta

    for left, right in zip(c, c[1:]):
        left_diff = total - left
        right_diff = total - right
        if right > total and abs(total - right) > abs(total - left):
            group += 1
            total += delta
        indices.append(group)

    return indices

df.assign(result=partition(df.value1.to_numpy().cumsum(), n=2))

   value1  value2  result
4      28    0.50       1
1      27    0.43       1
0      25    0.34       1
3      22    0.43       2
2      20    0.54       2
5      20    0.70       2

This is NOT optimal. 不是最佳选择。 This is a greedy heuristic. 这是一种贪婪的启发式方法。 It goes through the list and finds where we step over to the next group. 它遍历列表并找到我们进入下一组的位置。 At that point it decides whether it's better to include the current point in the current group or the next group. 此时,它决定将当前点包含在当前组或下一组中是否更好。

This should behave pretty well except in cases with huge disparity in values with the larger values coming towards the end. 这应该表现得非常好,除非价值观存在巨大差异且价值较大的情况即将结束。 This is because this algorithm is greedy and only looks at what it knows at the moment and not everything at once. 这是因为这个算法很贪婪,只能查看它当前所知的内容,而不是一次查看所有内容。

But like I said, it should be good enough. 但就像我说的那样,它应该足够好了。

I think, this is a kind of optimalisation problem (non-linear) and Pandas is definitively not any good candidate to solve it. 我认为,这是一种最优化问题(非线性),而熊猫绝对不是解决问题的好方法。

The basic idea to solve the problem can be as follows: 解决问题的基本思路如下:

  1. Definitions: 定义:

    • n - number of elements, n - 元素数量,
    • groupNo - the number of groups to divide into. groupNo - 要分组的组数。
  2. Start from generating an initial solution , eg take consecutive groups of n / groupNo elements into each bin . 从生成初始解决方案开始 ,例如将连续的n / groupNo元素组放入每个bin中

  3. Define the goal function , eg sum of squares of differences between sum of each group and sum of all elements / groupNo . 定义目标函数 ,例如,每个组的总和与所有元素/组的总和之间的差的平方和。

  4. Perform an iteration: 执行迭代:

    • for each pair of elements a and b from different bins, calculate the new goal function value, if these elements were moved to the other bin, 对于来自不同箱柜的每对元素ab ,计算新的目标函数值,如果这些元素被移动到另一个箱柜,
    • select the pair that gives the greater improvement of the goal function and perform the move (move a from its present bin to the bin, where b is, and vice versa). 选择能够更好地改进目标函数的对并执行移动(将a从其当前bin移动到bin,其中b是,反之亦然)。
  5. If no such pair can be found, then we have the final result. 如果找不到这样的对,那么我们得到最终结果。

Maybe someone will propose a better solution, but at least this solution is some concept to start with. 也许有人会提出更好的解决方案,但至少这个解决方案是一个开始的概念。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将列的一部分添加到新的 Pandas 数据框中? - How can I add parts of a column to a new pandas data frame? 如何为熊猫数据框中的一组行分配组号? - How do I assign a group # to a set of rows in a pandas data frame? 如何使用 pandas 根据日期列和天数列向数据框添加行 - How do I use pandas to add rows to a data frame based on a date column and number of days column 如何根据 integer 列将组分配给 python-pandas 数据? - How do I assign groups to python-pandas data based on integer column? 如何根据python中的pandas数据框中的列按降序分组? (Jupyter 笔记本) - How do I group by based on a column in pandas data frame in python and in descending order? (Jupyter Notebook) 在不知道行数的情况下,根据行数将数据帧分成六个相等的部分 - pandas - Split a data frame into six equal parts based on number of rows without knowing the number of rows - pandas 如何转换 Pandas 数据框中的一列数据? - How do I convert a column of data in a Pandas data frame? 如何按列的值对pandas数据帧的行进行分组? - How do I group the rows of a pandas data frame by a value of a column? 如何根据条件在熊猫数据框的多列上分配值 - How to assign values on multiple columns of a pandas data frame based on condition 如何在 pandas 数据帧的特定列索引处插入列? (更改 pandas 数据帧中的列顺序) - how do I insert a column at a specific column index in pandas data frame? (Change column order in pandas data frame)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM