[英]Python pandas groupby aggregate on multiple columns, then pivot
In Python, I have a pandas DataFrame similar to the following: 在Python中,我有一个类似于以下的pandas DataFrame:
Item | shop1 | shop2 | shop3 | Category
------------------------------------
Shoes| 45 | 50 | 53 | Clothes
TV | 200 | 300 | 250 | Technology
Book | 20 | 17 | 21 | Books
phone| 300 | 350 | 400 | Technology
Where shop1, shop2 and shop3 are the costs of every item in different shops. shop1,shop2和shop3是不同商店中每件商品的成本。 Now, I need to return a DataFrame, after some data cleaning, like this one:
现在,我需要在一些数据清理后返回一个DataFrame,如下所示:
Category (index)| size| sum| mean | std
----------------------------------------
where size is the number of items in each Category and sum, mean and std are related to the same functions applied to the 3 shops. 其中size是每个Category中的项目数和sum,mean和std与应用于3个商店的相同功能相关。 How can I do these operations with the split-apply-combine pattern (groupby, aggregate, apply,...) ?
如何使用split-apply-combine模式(groupby,aggregate,apply,...)执行这些操作?
Can someone help me out? 有人可以帮我吗? I'm going crazy with this one...thank you!
我对这个疯狂了......谢谢!
Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation. 编辑Pandas 0.22+考虑通过聚合弃用组中的词典。
We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns. 我们建立了一个非常相似的字典,我们使用字典的键来指定我们的函数,使用字典本身来重命名列。
rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
.agg(rnm_cols.keys()).rename(columns=rnm_cols)
Size Sum Mean Std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
option 1 选项1
use agg
← link to docs 使用
agg
←链接到docs
agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)
Std Sum Mean Size
Category
Books 2.081666 58 19.333333 3
Clothes 4.041452 148 49.333333 3
Technology 70.710678 1800 300.000000 6
option 2 选项2
more for less 更多,更少
use describe
← link to docs 使用
describe
←链接到docs
df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()
count mean std min 25% 50% 75% max
Category
Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0
Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0
Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0
df.groupby('Category').agg({'Item':'size','shop1':['sum','mean','std'],'shop2':['sum','mean','std'],'shop3':['sum','mean','std']})
Or if you want it across all shops then: 或者,如果您想在所有商店中使用它,那么:
df1 = df.set_index(['Item','Category']).stack().reset_index().rename(columns={'level_2':'Shops',0:'costs'})
df1.groupby('Category').agg({'Item':'size','costs':['sum','mean','std']})
If I understand correctly, you want to calculate aggregate metrics for all shops, not for each individually. 如果我理解正确,您希望计算所有商店的汇总指标,而不是单独计算每个商店的汇总指标。 To do that, you can first
stack
your dataframe and then group by Category
: 为此,您可以先
stack
数据帧,然后按Category
分组:
stacked = df.set_index(['Item', 'Category']).stack().reset_index()
stacked.columns = ['Item', 'Category', 'Shop', 'Price']
stacked.groupby('Category').agg({'Price':['count','sum','mean','std']})
Which results in 结果如何
Price
count sum mean std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.