[英]Groupby sum, index vs. column results
For the following dataframe: 对于以下数据帧:
df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data'])
print(df)
group data
0 a 5
1 a 10
2 b 100
3 b 30
When grouping by column, adding and creating a new column, the result is: 按列分组,添加和创建新列时,结果为:
df['new'] = df.groupby('group')['data'].sum()
print(df)
group data new
0 a 5 NaN
1 a 10 NaN
2 b 100 NaN
3 b 30 NaN
However if we reset the df to the original data and move the group column to the index, 但是,如果我们将df重置为原始数据并将组列移动到索引,
df.set_index('group', inplace=True)
print(df)
data
group
a 5
a 10
b 100
b 30
And then group and sum, then we get: 然后分组和总和,然后我们得到:
df['new'] = df.groupby('group')['data'].sum()
print(df)
data new
group
a 5 15
a 10 15
b 100 130
b 30 130
Why does the column group not set the values in the new column but the index grouping does set the values in the new column? 为什么列组未在新列中设置值,但索引分组是否设置了新列中的值?
Better here is use GroupBy.transform
for return Series with same size like original DataFrame
, so after assign all working correctly: 这里更好的是使用
GroupBy.transform
返回与原始DataFrame
相同大小的返回系列,因此在分配所有正常工作之后:
df['new'] = df.groupby('group')['data'].transform('sum')
Because if assign new Series values are align by index values. 因为如果分配新的Series值是按索引值对齐的。 If index is different, get
NaN
s: 如果索引不同,请获取
NaN
:
print (df.groupby('group')['data'].sum())
group
a 15
b 130
Name: data, dtype: int64
Different index values - get NaNs: 不同的索引值 - 获取NaNs:
print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')
print (df.index)
RangeIndex(start=0, stop=4, step=1)
df.set_index('group', inplace=True)
print (df.groupby('group')['data'].sum())
group
a 15
b 130
Name: data, dtype: int64
Index can align, because values matched: 索引可以对齐,因为值匹配:
print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')
print (df.index)
Index(['a', 'a', 'b', 'b'], dtype='object', name='group')
You're not getting what you want because when using df.groupby('group')['data'].sum()
, this is returning an aggregated result with group
as index: 你没有得到你想要的东西,因为当使用
df.groupby('group')['data'].sum()
,这将返回一个聚合结果,其中group
作为索引:
group
a 15
b 130
Name: data, dtype: int64
Where clearly indexes are not aligned. 明确索引不对齐的地方。
If you want this to work you'll have to use transform
, which returns a Series with the transformed vales which has the same axis length as self : 如果你希望这个工作,你将不得不使用
transform
,它返回一个具有转换的vales的Series,它具有与self相同的轴长 :
df['new'] = df.groupby('group')['data'].transform('sum')
group data new
0 a 5 15
1 a 10 15
2 b 100 130
3 b 30 130
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.