简体   繁体   English

在 Pandas 中,在 groupby 之后分组列消失了

[英]In Pandas, after groupby the grouped column is gone

I have the following dataframe named ttm:我有以下名为 ttm 的数据框:

    usersidid   clienthostid    eventSumTotal   LoginDaysSum    score
0       12          1               60              3           1728
1       11          1               240             3           1331
3       5           1               5               3           125
4       6           1               16              2           216
2       10          3               270             3           1000
5       8           3               18              2           512

When i do当我做

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):我得到了我的预期(虽然我希望结果在一个名为“比率”的新标签下):

       clienthostid  LoginDaysSum
0             1          4
1             3          2

But when I do但是当我这样做时

ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

I get:我得到:

0    1.0
1    1.5
  1. Why did the labels go?为什么标签消失了? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too我仍然需要分组需要“clienthostid”,我还需要申请的结果也在标签下
  2. Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays?有时当我执行 groupby 时,其他一些列仍然出现,为什么有时列消失有时保持? is there a flag I'm missing that do those stuff?有没有我缺少的标志可以做这些事情?
  3. In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?在我给出的示例中,当我计算标签“LoginDaysSum”上显示的结果时,为什么要为结果添加新标签?

Thank you,谢谢,

For return DataFrame after groupby are 2 possible solutions:groupby之后返回DataFrame有两种可能的解决方案:

  1. parameter as_index=False what works nice with count , sum , mean functions参数as_index=Falsecountsummean函数配合使用的效果很好

  2. reset_index for create new column from levels of index , more general solution reset_index用于从index级别创建新列,更通用的解决方案

df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2

For second need remove as_index=False and instead add reset_index :第二个需要删除as_index=False并添加reset_index

#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
         .apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1    1.0
3    1.5
Name: LoginDaysSum, dtype: float64

print (type(a))
<class 'pandas.core.series.Series'>

print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')


df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
         .apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
   clienthostid  ratio
0             1    1.0
1             3    1.5

Why some columns are gone?为什么有些列不见了?

I think there can be problem automatic exclusion of nuisance columns :我认为自动排除令人讨厌的列可能存在问题:

#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
  usersidid  clienthostid  eventSumTotal  LoginDaysSum  score
0      12aa             1             60             3   1728
1      11aa             1            240             3   1331
3       5aa             1              5             3    125
4       6aa             1             16             2    216
2      10aa             3            270             3   1000
5       8aa             3             18             2    512

#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
              eventSumTotal  LoginDaysSum  score
clienthostid                                    
1                       321            11   3400
3                       288             5   1512

What is the difference between size and count in pandas? 熊猫的大小和数量有什么区别?

count is a built in method for the groupby object and pandas knows what to do with it. countgroupby对象的内置方法,pandas 知道如何处理它。 There are two other things specified that goes into determining what the out put looks like.还指定了另外两件事来确定输出的样子。

#                         For a built in method, when
#                         you don't want the group column
#                         as the index, pandas keeps it in
#                         as a column.
#                             |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

   clienthostid  LoginDaysSum
0             1             4
1             3             2

#                         For a built in method, when
#                         you do want the group column
#                         as the index, then...
#                             |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
#                                                       |-----||||-----|
#                                                 the single brackets tells
#                                                 pandas to operate on a series
#                                                 in this case, count the series

clienthostid
1    4
3    2
Name: LoginDaysSum, dtype: int64

ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
#                                                       |------||||------|
#                                             the double brackets tells pandas
#                                                to operate on the dataframe
#                                              specified by these columns and will
#                                                return a dataframe

              LoginDaysSum
clienthostid              
1                        4
3                        2

When you used apply pandas no longer knows what to do with the group column when you say as_index=False .当您使用apply时,当您说as_index=False时,pandas 不再知道如何处理 group 列。 It has to trust that if you use apply you want returned exactly what you say to return, so it will just throw it away.它必须相信如果你使用apply你想要返回你所说的返回,所以它只会把它扔掉。 Also, you have single brackets around your column which says to operate on a series.此外,您的列周围有单个括号,表示对系列进行操作。 Instead, use as_index=True to keep the grouping column information in the index.相反,使用as_index=True将分组列信息保留在索引中。 Then follow it up with a reset_index to transfer it from the index back into the dataframe.然后用reset_index跟进它,将它从索引传输回数据帧。 At this point, it will not have mattered that you used single brackets because after the reset_index you'll have a dataframe again.在这一点上,您使用单括号reset_index因为在reset_index您将再次拥有一个数据帧。

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

0    1.0
1    1.5
dtype: float64

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

   clienthostid  LoginDaysSum
0             1           1.0
1             3           1.5

Reading the groupy documentarion , a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.阅读groupy 文档,发现 groupby 后自动排除列通常是由排除的列中存在空值引起的。

Try fill the 'null' with some value.尝试用一些值填充 'null'。

Like this:像这样:

df.fillna('')

You simply need this instead:你只需要这个:

ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

The double [[]] will turn the output into a pd.Dataframe instead of a pd.Series. double [[]]会将输出转换为 pd.Dataframe 而不是 pd.Series。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM