[英]In Pandas, after groupby the grouped column is gone
I have the following dataframe named ttm:我有以下名为 ttm 的数据框:
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12 1 60 3 1728
1 11 1 240 3 1331
3 5 1 5 3 125
4 6 1 16 2 216
2 10 3 270 3 1000
5 8 3 18 2 512
When i do当我做
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):我得到了我的预期(虽然我希望结果在一个名为“比率”的新标签下):
clienthostid LoginDaysSum
0 1 4
1 3 2
But when I do但是当我这样做时
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
I get:我得到:
0 1.0
1 1.5
Thank you,谢谢,
For return DataFrame
after groupby
are 2 possible solutions:在
groupby
之后返回DataFrame
有两种可能的解决方案:
parameter as_index=False
what works nice with count
, sum
, mean
functions参数
as_index=False
与count
、 sum
、 mean
函数配合使用的效果很好
reset_index
for create new column from levels of index
, more general solution reset_index
用于从index
级别创建新列,更通用的解决方案
df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
For second need remove as_index=False
and instead add reset_index
:第二个需要删除
as_index=False
并添加reset_index
:
#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
.apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1 1.0
3 1.5
Name: LoginDaysSum, dtype: float64
print (type(a))
<class 'pandas.core.series.Series'>
print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')
df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
.apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
clienthostid ratio
0 1 1.0
1 3 1.5
Why some columns are gone?为什么有些列不见了?
I think there can be problem automatic exclusion of nuisance columns :我认为自动排除令人讨厌的列可能存在问题:
#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12aa 1 60 3 1728
1 11aa 1 240 3 1331
3 5aa 1 5 3 125
4 6aa 1 16 2 216
2 10aa 3 270 3 1000
5 8aa 3 18 2 512
#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
eventSumTotal LoginDaysSum score
clienthostid
1 321 11 3400
3 288 5 1512
What is the difference between size and count in pandas? 熊猫的大小和数量有什么区别?
count
is a built in method for the groupby
object and pandas knows what to do with it. count
是groupby
对象的内置方法,pandas 知道如何处理它。 There are two other things specified that goes into determining what the out put looks like.还指定了另外两件事来确定输出的样子。
# For a built in method, when
# you don't want the group column
# as the index, pandas keeps it in
# as a column.
# |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
clienthostid LoginDaysSum
0 1 4
1 3 2
# For a built in method, when
# you do want the group column
# as the index, then...
# |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
# |-----||||-----|
# the single brackets tells
# pandas to operate on a series
# in this case, count the series
clienthostid
1 4
3 2
Name: LoginDaysSum, dtype: int64
ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
# |------||||------|
# the double brackets tells pandas
# to operate on the dataframe
# specified by these columns and will
# return a dataframe
LoginDaysSum
clienthostid
1 4
3 2
When you used apply
pandas no longer knows what to do with the group column when you say as_index=False
.当您使用
apply
时,当您说as_index=False
时,pandas 不再知道如何处理 group 列。 It has to trust that if you use apply
you want returned exactly what you say to return, so it will just throw it away.它必须相信如果你使用
apply
你想要返回你所说的返回,所以它只会把它扔掉。 Also, you have single brackets around your column which says to operate on a series.此外,您的列周围有单个括号,表示对系列进行操作。 Instead, use
as_index=True
to keep the grouping column information in the index.相反,使用
as_index=True
将分组列信息保留在索引中。 Then follow it up with a reset_index
to transfer it from the index back into the dataframe.然后用
reset_index
跟进它,将它从索引传输回数据帧。 At this point, it will not have mattered that you used single brackets because after the reset_index
you'll have a dataframe again.在这一点上,您使用单括号
reset_index
因为在reset_index
您将再次拥有一个数据帧。
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
0 1.0
1 1.5
dtype: float64
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
clienthostid LoginDaysSum
0 1 1.0
1 3 1.5
Reading the groupy documentarion , a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.阅读groupy 文档,发现 groupby 后自动排除列通常是由排除的列中存在空值引起的。
Try fill the 'null' with some value.尝试用一些值填充 'null'。
Like this:像这样:
df.fillna('')
You simply need this instead:你只需要这个:
ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
The double [[]]
will turn the output into a pd.Dataframe instead of a pd.Series. double
[[]]
会将输出转换为 pd.Dataframe 而不是 pd.Series。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.