简体   繁体   English

Groupby sum,index vs. column results

[英]Groupby sum, index vs. column results

For the following dataframe: 对于以下数据帧:

df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data']) 
print(df)

  group  data
0     a     5
1     a    10
2     b   100
3     b    30

When grouping by column, adding and creating a new column, the result is: 按列分组,添加和创建新列时,结果为:

df['new'] = df.groupby('group')['data'].sum() 
print(df)

  group  data  new
0     a     5  NaN
1     a    10  NaN
2     b   100  NaN
3     b    30  NaN

However if we reset the df to the original data and move the group column to the index, 但是,如果我们将df重置为原始数据并将组列移动到索引,

df.set_index('group', inplace=True)
print(df)

       data
group      
a         5
a        10
b       100
b        30

And then group and sum, then we get: 然后分组和总和,然后我们得到:

df['new'] = df.groupby('group')['data'].sum() 
print(df)

       data  new
group           
a         5   15
a        10   15
b       100  130
b        30  130

Why does the column group not set the values in the new column but the index grouping does set the values in the new column? 为什么列组未在新列中设置值,但索引分组是否设置了新列中的值?

Better here is use GroupBy.transform for return Series with same size like original DataFrame , so after assign all working correctly: 这里更好的是使用GroupBy.transform返回与原始DataFrame相同大小的返回系列,因此在分配所有正常工作之后:

df['new'] = df.groupby('group')['data'].transform('sum')

Because if assign new Series values are align by index values. 因为如果分配新的Series值是按索引值对齐的。 If index is different, get NaN s: 如果索引不同,请获取NaN

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

Different index values - get NaNs: 不同的索引值 - 获取NaNs:

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
RangeIndex(start=0, stop=4, step=1)

df.set_index('group', inplace=True)

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

Index can align, because values matched: 索引可以对齐,因为值匹配:

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
Index(['a', 'a', 'b', 'b'], dtype='object', name='group')

You're not getting what you want because when using df.groupby('group')['data'].sum() , this is returning an aggregated result with group as index: 你没有得到你想要的东西,因为当使用df.groupby('group')['data'].sum() ,这将返回一个聚合结果,其中group作为索引:

group
a     15
b    130
Name: data, dtype: int64

Where clearly indexes are not aligned. 明确索引不对齐的地方。

If you want this to work you'll have to use transform , which returns a Series with the transformed vales which has the same axis length as self : 如果你希望这个工作,你将不得不使用transform ,它返回一个具有转换的vales的Series,它具有与self相同的轴长

df['new'] = df.groupby('group')['data'].transform('sum')

   group  data  new
0     a     5   15
1     a    10   15
2     b   100  130
3     b    30  130

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM