Groupby sum，index vs. column results

Question

For the following dataframe: 对于以下数据帧：

df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data']) 
print(df)

  group  data
0     a     5
1     a    10
2     b   100
3     b    30

When grouping by column, adding and creating a new column, the result is: 按列分组，添加和创建新列时，结果为：

df['new'] = df.groupby('group')['data'].sum() 
print(df)

  group  data  new
0     a     5  NaN
1     a    10  NaN
2     b   100  NaN
3     b    30  NaN

However if we reset the df to the original data and move the group column to the index, 但是，如果我们将df重置为原始数据并将组列移动到索引，

df.set_index('group', inplace=True)
print(df)

       data
group      
a         5
a        10
b       100
b        30

And then group and sum, then we get: 然后分组和总和，然后我们得到：

df['new'] = df.groupby('group')['data'].sum() 
print(df)

       data  new
group           
a         5   15
a        10   15
b       100  130
b        30  130

Why does the column group not set the values in the new column but the index grouping does set the values in the new column? 为什么列组未在新列中设置值，但索引分组是否设置了新列中的值？

Answer 1

Better here is use GroupBy.transform for return Series with same size like original DataFrame , so after assign all working correctly: 这里更好的是使用GroupBy.transform返回与原始DataFrame相同大小的返回系列，因此在分配所有正常工作之后：

df['new'] = df.groupby('group')['data'].transform('sum')

Because if assign new Series values are align by index values. 因为如果分配新的Series值是按索引值对齐的。 If index is different, get NaN s: 如果索引不同，请获取NaN ：

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

Different index values - get NaNs: 不同的索引值 - 获取NaNs：

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
RangeIndex(start=0, stop=4, step=1)

df.set_index('group', inplace=True)

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

Index can align, because values matched: 索引可以对齐，因为值匹配：

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
Index(['a', 'a', 'b', 'b'], dtype='object', name='group')

Answer 2

You're not getting what you want because when using df.groupby('group')['data'].sum() , this is returning an aggregated result with group as index: 你没有得到你想要的东西，因为当使用df.groupby('group')['data'].sum() ，这将返回一个聚合结果，其中group作为索引：

group
a     15
b    130
Name: data, dtype: int64

Where clearly indexes are not aligned. 明确索引不对齐的地方。

If you want this to work you'll have to use transform , which returns a Series with the transformed vales which has the same axis length as self : 如果你希望这个工作，你将不得不使用transform ，它返回一个具有转换的vales的Series，它具有与self相同的轴长 ：

df['new'] = df.groupby('group')['data'].transform('sum')

   group  data  new
0     a     5   15
1     a    10   15
2     b   100  130
3     b    30  130

Groupby sum，index vs. column results

问题描述

2 个解决方案

解决方案1
4 已采纳 2019-05-17 09:03:54

解决方案2
2 2019-05-17 09:04:33

Groupby sum，index vs. column results

问题描述

2 个解决方案

解决方案1 4 已采纳 2019-05-17 09:03:54

解决方案2 2 2019-05-17 09:04:33

解决方案1
4 已采纳 2019-05-17 09:03:54

解决方案2
2 2019-05-17 09:04:33