用两列在 pandas groupby 中排序

Question

I am trying to group a dataframe with two columns and avoid default sorting using 'sort = False'.我正在尝试将 dataframe 与两列分组，并避免使用“sort = False”进行默认排序。 However, I am unable to achieve this.但是，我无法做到这一点。

Here is the simplified example这是简化的示例

df = pd.DataFrame([
        ['zebra', 1, 10],
        ['zebra', 2, 10],
        ['apple', 3, 20],
        ['apple', 4, 20],
    ],
    columns=['ColA','ColB','ColC'])

df is therefore因此 df 是

    ColA  ColB  ColC
0  zebra     1    10
1  zebra     2    10
2  apple     3    20
3  apple     4    20

I am using pandas (1.0.3) groupby and disabling sorting of the keys我正在使用 pandas (1.0.3) groupby 并禁用键的排序

df_agg = df.groupby(by=['ColA','ColB'], sort = False)

df_agg.groups

results in结果是

{('apple', 3): Int64Index([2], dtype='int64'),
 ('apple', 4): Int64Index([3], dtype='int64'),
 ('zebra', 1): Int64Index([0], dtype='int64'),
 ('zebra', 2): Int64Index([1], dtype='int64')}

which is the same as "sort = True" (default)这与“sort = True”相同（默认）

However, what I would like is as following但是，我想要的是如下

{
 ('zebra', 1): Int64Index([0], dtype='int64'),
 ('zebra', 2): Int64Index([1], dtype='int64'),
 ('apple', 3): Int64Index([2], dtype='int64'),
 ('apple', 4): Int64Index([3], dtype='int64')
}

'sort = False' when grouping by one column seems to be working fine.按一列分组时的'sort = False'似乎工作正常。

df_agg = df.groupby(by=['ColA'], sort = False)
df_agg.groups

results in结果是

{'zebra': Int64Index([0, 1], dtype='int64'),
 'apple': Int64Index([2, 3], dtype='int64')}

If sorting only works on one column and not on tuples.如果排序仅适用于一列而不适用于元组。 I could sort the groups dict based on the tuple, but I am using an application that is expecting a groupby object.我可以根据元组对组 dict 进行排序，但我正在使用一个期望 groupby object 的应用程序。 I appreciate any pointers on how this can be addressed.我感谢有关如何解决此问题的任何指示。

Answer 1

The groups attribute is a dictionary and NOT where order of groups is determined. groups属性是一个字典，而不是确定组顺序的地方。 You must "resolve" the groupby object with some operation to determine what the order is/was.您必须通过一些操作来“解决” groupby object 以确定订单是什么。

df.groupby(['ColA', 'ColB'], sort=False, as_index=False).first()

    ColA  ColB  ColC
0  zebra     1    10
1  zebra     2    10
2  apple     3    20
3  apple     4    20

Versus相对

df.groupby(['ColA', 'ColB'], as_index=False).first()

    ColA  ColB  ColC
0  apple     3    20
1  apple     4    20
2  zebra     1    10
3  zebra     2    10

The ACTUAL place to look is the groupby object's ngroup method实际查看的地方是groupby对象的ngroup方法

g1 = df.groupby(['ColA', 'ColB'], sort=False, as_index=False)
g1.ngroup()

0    0
1    1
2    2
3    3
dtype: int64

Versus相对

g2 = df.groupby(['ColA', 'ColB'], as_index=False)
g2.ngroup()

0    2
1    3
2    0
3    1
dtype: int64

Answer 2

Let's use a psuedo sort key, here I create one using pd.factorize :让我们使用一个伪排序键，这里我使用pd.factorize创建一个：

df.assign(sortkey=pd.factorize(df['ColA'])[0]).groupby(['sortkey', 'ColA', 'ColB']).groups

Output: Output：

{(0, 'zebra', 1): Int64Index([0], dtype='int64'),
 (0, 'zebra', 2): Int64Index([1], dtype='int64'),
 (1, 'apple', 3): Int64Index([2], dtype='int64'),
 (1, 'apple', 4): Int64Index([3], dtype='int64')}

用两列在 pandas groupby 中排序

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-04-20 21:45:58

解决方案2
3 2020-04-20 21:39:53

用两列在 pandas groupby 中排序

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-04-20 21:45:58

解决方案2 3 2020-04-20 21:39:53

解决方案1
4 已采纳 2020-04-20 21:45:58

解决方案2
3 2020-04-20 21:39:53