在 Pandas 中，在 groupby 之后分组列消失了

Question

我有以下名为 ttm 的数据框：

    usersidid   clienthostid    eventSumTotal   LoginDaysSum    score
0       12          1               60              3           1728
1       11          1               240             3           1331
3       5           1               5               3           125
4       6           1               16              2           216
2       10          3               270             3           1000
5       8           3               18              2           512

当我做

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

我得到了我的预期（虽然我希望结果在一个名为“比率”的新标签下）：

       clienthostid  LoginDaysSum
0             1          4
1             3          2

但是当我这样做时

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

我得到：

0    1.0
1    1.5

为什么标签消失了？ 我仍然需要分组需要“clienthostid”，我还需要申请的结果也在标签下
有时当我执行 groupby 时，其他一些列仍然出现，为什么有时列消失有时保持？ 有没有我缺少的标志可以做这些事情？
在我给出的示例中，当我计算标签“LoginDaysSum”上显示的结果时，为什么要为结果添加新标签？

谢谢，

Answer 1

在groupby之后返回DataFrame有两种可能的解决方案：

参数as_index=False与count 、 sum 、 mean函数配合使用的效果很好
reset_index用于从index级别创建新列，更通用的解决方案

df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2

df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2

第二个需要删除as_index=False并添加reset_index ：

#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
         .apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1    1.0
3    1.5
Name: LoginDaysSum, dtype: float64

print (type(a))
<class 'pandas.core.series.Series'>

print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')


df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
         .apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
   clienthostid  ratio
0             1    1.0
1             3    1.5

为什么有些列不见了？

我认为自动排除令人讨厌的列可能存在问题：

#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
  usersidid  clienthostid  eventSumTotal  LoginDaysSum  score
0      12aa             1             60             3   1728
1      11aa             1            240             3   1331
3       5aa             1              5             3    125
4       6aa             1             16             2    216
2      10aa             3            270             3   1000
5       8aa             3             18             2    512

#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
              eventSumTotal  LoginDaysSum  score
clienthostid                                    
1                       321            11   3400
3                       288             5   1512

熊猫的大小和数量有什么区别？

Answer 2

count是groupby对象的内置方法，pandas 知道如何处理它。 还指定了另外两件事来确定输出的样子。

#                         For a built in method, when
#                         you don't want the group column
#                         as the index, pandas keeps it in
#                         as a column.
#                             |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

   clienthostid  LoginDaysSum
0             1             4
1             3             2

#                         For a built in method, when
#                         you do want the group column
#                         as the index, then...
#                             |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
#                                                       |-----||||-----|
#                                                 the single brackets tells
#                                                 pandas to operate on a series
#                                                 in this case, count the series

clienthostid
1    4
3    2
Name: LoginDaysSum, dtype: int64

ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
#                                                       |------||||------|
#                                             the double brackets tells pandas
#                                                to operate on the dataframe
#                                              specified by these columns and will
#                                                return a dataframe

              LoginDaysSum
clienthostid              
1                        4
3                        2

当您使用apply时，当您说as_index=False时，pandas 不再知道如何处理 group 列。 它必须相信如果你使用apply你想要返回你所说的返回，所以它只会把它扔掉。 此外，您的列周围有单个括号，表示对系列进行操作。 相反，使用as_index=True将分组列信息保留在索引中。 然后用reset_index跟进它，将它从索引传输回数据帧。 在这一点上，您使用单括号reset_index因为在reset_index您将再次拥有一个数据帧。

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

0    1.0
1    1.5
dtype: float64

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

   clienthostid  LoginDaysSum
0             1           1.0
1             3           1.5

Answer 3

阅读groupy 文档，发现 groupby 后自动排除列通常是由排除的列中存在空值引起的。

尝试用一些值填充 'null'。

像这样：

df.fillna('')

Answer 4

你只需要这个：

ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

double [[]]会将输出转换为 pd.Dataframe 而不是 pd.Series。

在 Pandas 中，在 groupby 之后分组列消失了

问题描述

4 个解决方案

解决方案1
23 已采纳 2017-01-15 06:34:21

解决方案2
8 2017-01-15 08:10:25

解决方案3
1 2020-02-02 10:39:05

解决方案4
0 2020-11-19 19:02:03

在 Pandas 中，在 groupby 之后分组列消失了

问题描述

4 个解决方案

解决方案1 23 已采纳 2017-01-15 06:34:21

解决方案2 8 2017-01-15 08:10:25

解决方案3 1 2020-02-02 10:39:05

解决方案4 0 2020-11-19 19:02:03

解决方案1
23 已采纳 2017-01-15 06:34:21

解决方案2
8 2017-01-15 08:10:25

解决方案3
1 2020-02-02 10:39:05

解决方案4
0 2020-11-19 19:02:03