[英]sorting in pandas groupby with two columns
I am trying to group a dataframe with two columns and avoid default sorting using 'sort = False'.我正在尝试将 dataframe 与两列分组,并避免使用“sort = False”进行默认排序。 However, I am unable to achieve this.但是,我无法做到这一点。
Here is the simplified example这是简化的示例
df = pd.DataFrame([
['zebra', 1, 10],
['zebra', 2, 10],
['apple', 3, 20],
['apple', 4, 20],
],
columns=['ColA','ColB','ColC'])
df is therefore因此 df 是
ColA ColB ColC
0 zebra 1 10
1 zebra 2 10
2 apple 3 20
3 apple 4 20
I am using pandas (1.0.3) groupby and disabling sorting of the keys我正在使用 pandas (1.0.3) groupby 并禁用键的排序
df_agg = df.groupby(by=['ColA','ColB'], sort = False)
df_agg.groups
results in结果是
{('apple', 3): Int64Index([2], dtype='int64'),
('apple', 4): Int64Index([3], dtype='int64'),
('zebra', 1): Int64Index([0], dtype='int64'),
('zebra', 2): Int64Index([1], dtype='int64')}
which is the same as "sort = True" (default)这与“sort = True”相同(默认)
However, what I would like is as following但是,我想要的是如下
{
('zebra', 1): Int64Index([0], dtype='int64'),
('zebra', 2): Int64Index([1], dtype='int64'),
('apple', 3): Int64Index([2], dtype='int64'),
('apple', 4): Int64Index([3], dtype='int64')
}
'sort = False' when grouping by one column seems to be working fine.按一列分组时的'sort = False'似乎工作正常。
df_agg = df.groupby(by=['ColA'], sort = False)
df_agg.groups
results in结果是
{'zebra': Int64Index([0, 1], dtype='int64'),
'apple': Int64Index([2, 3], dtype='int64')}
If sorting only works on one column and not on tuples.如果排序仅适用于一列而不适用于元组。 I could sort the groups dict based on the tuple, but I am using an application that is expecting a groupby object.我可以根据元组对组 dict 进行排序,但我正在使用一个期望 groupby object 的应用程序。 I appreciate any pointers on how this can be addressed.我感谢有关如何解决此问题的任何指示。
The groups
attribute is a dictionary and NOT where order of groups is determined. groups
属性是一个字典,而不是确定组顺序的地方。 You must "resolve" the groupby
object with some operation to determine what the order is/was.您必须通过一些操作来“解决” groupby
object 以确定订单是什么。
df.groupby(['ColA', 'ColB'], sort=False, as_index=False).first()
ColA ColB ColC
0 zebra 1 10
1 zebra 2 10
2 apple 3 20
3 apple 4 20
Versus相对
df.groupby(['ColA', 'ColB'], as_index=False).first()
ColA ColB ColC
0 apple 3 20
1 apple 4 20
2 zebra 1 10
3 zebra 2 10
The ACTUAL place to look is the groupby
object's ngroup
method实际查看的地方是groupby
对象的ngroup
方法
g1 = df.groupby(['ColA', 'ColB'], sort=False, as_index=False)
g1.ngroup()
0 0
1 1
2 2
3 3
dtype: int64
Versus相对
g2 = df.groupby(['ColA', 'ColB'], as_index=False)
g2.ngroup()
0 2
1 3
2 0
3 1
dtype: int64
Let's use a psuedo sort key, here I create one using pd.factorize
:让我们使用一个伪排序键,这里我使用pd.factorize
创建一个:
df.assign(sortkey=pd.factorize(df['ColA'])[0]).groupby(['sortkey', 'ColA', 'ColB']).groups
Output: Output:
{(0, 'zebra', 1): Int64Index([0], dtype='int64'),
(0, 'zebra', 2): Int64Index([1], dtype='int64'),
(1, 'apple', 3): Int64Index([2], dtype='int64'),
(1, 'apple', 4): Int64Index([3], dtype='int64')}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.