简体   繁体   English

从 groupby 和多重聚合中展平分层索引的 pandas.DataFrame

[英]Flatten hierarchically indexed pandas.DataFrame from groupby and multiple aggregation

I'm grouping a dataframe by multiple columns and aggregating to obtain multiple statistics.我正在按多列对数据框进行分组并聚合以获得多个统计信息。 How to obtain a totally flat structure with each possible combination of group-keys enumerated as rows and each statistic present as columns?如何获得一个完全扁平的结构,其中每个可能的组键组合被枚举为行,每个统计数据都作为列存在?

import numpy as np
import pandas as pd

cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']

data = pd.DataFrame({
        'city': np.random.choice(cities, 12),
        'day': np.random.choice(days, 12),
        'people': np.random.normal(loc=10, size=12),
        'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg([np.mean, np.std])

This way I'm getting:这样我得到:

                   cats               people          
                   mean       std       mean       std
city   day                                            
Berlin Friday  6.146924  0.721263  10.445606  0.730992
       Monday  5.239267       NaN   9.022811       NaN
Oslo   Friday  6.322276  0.866899  11.579813  0.114341
       Monday  5.028919  0.815674  10.458439  1.182689

I need to get it flat:我需要把它弄平:

city   day     cats_mean cats_std  people_mean people_std                                       
Berlin Friday  6.146924  0.721263  10.445606   0.730992
Berlin Monday  5.239267       NaN   9.022811        NaN
Oslo   Friday  6.322276  0.866899  11.579813   0.114341
Oslo   Monday  5.028919  0.815674  10.458439   1.182689
In [36]: grouped.columns = grouped.columns.map('_'.join)

In [37]: grouped = grouped.reset_index()

In [38]: grouped
Out[38]:
     city     day  cats_mean  cats_std  people_mean  people_std
0  Berlin  Friday   5.852991  1.085163    11.078541    0.839688
1  Berlin  Monday   6.978343  0.630983     9.876106    1.846204
2    Oslo  Friday   6.096773  1.278176     9.710216    0.691672

Update using pd.NamedAgg使用 pd.NamedAgg 更新

import numpy as np
import pandas as pd

cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']

data = pd.DataFrame({
        'city': np.random.choice(cities, 12),
        'day': np.random.choice(days, 12),
        'people': np.random.normal(loc=10, size=12),
        'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg(cats_mean=('cats', np.mean),
                                            cats_std=('cats', np.std),
                                            people_mean=('people', np.mean),
                                            people_std=('people', np.std))

grouped.reset_index()

Nest renaming is deprecated不推荐使用 Nest 重命名

You can use a dictionary in .agg to rename your columns then drop column level and reset_index(): See this SO Post您可以使用 .agg 中的字典重命名您的列,然后删除列级别和 reset_index():请参阅此SO Post

import numpy as np import pandas as pd cities = ['Berlin', 'Oslo'] days = ['Monday', 'Friday'] data = pd.DataFrame({ 'city': np.random.choice(cities, 12), 'day': np.random.choice(days, 12), 'people': np.random.normal(loc=10, size=12), 'cats': np.random.normal(loc=6, size=12)}) grouped = data.groupby(['city', 'day']).agg({'cats':{'cats_mean':np.mean,'cats_std':np.std},'people':{'people_mean':np.mean,'people_std':np.std}}) grouped.columns = grouped.columns.droplevel() grouped.reset_index() --------------- import numpy as np import pandas as pdities = ['Berlin', 'Oslo'] days = ['Monday', 'Friday'] data = pd.DataFrame({ 'city': np.random.choice(cities, 12) ), 'day': np.random.choice(days, 12), 'people': np.random.normal(loc=10, size=12), 'cats': np.random.normal(loc=6, size=12)}) grouped = data.groupby(['city', 'day']).agg({'cats':{'cats_mean':np.mean,'cats_std':np.std},'people ':{'people_mean':np.mean,'people_std':np.std}}) grouped.columns = grouped.columns.droplevel() grouped.reset_index() ------------- ——

Output:输出:

 city day people_mean people_std cats_std cats_mean 0 Berlin Friday 9.645190 0.699684 0.973866 6.478510 1 Berlin Monday 9.556898 0.126810 0.336654 6.624288 2 Oslo Friday 11.593491 NaN NaN 6.206595 3 Oslo Monday 10.202183 1.058651 0.657939 6.019748

You can perform a list comprehension on the column levels and join with an underscore and then call reset_index :您可以在列级别执行列表理解并使用下划线连接,然后调用reset_index

In [39]:    
grouped.columns= ['_'.join(x) for x in list(zip(grouped.columns.get_level_values(0), grouped.columns.get_level_values(1)))]
grouped = grouped.reset_index()
grouped

Out[39]:
     city     day  cats_mean  cats_std  people_mean  people_std
0  Berlin  Friday   6.140710  0.555981    10.187634    0.359724
1  Berlin  Monday   6.420175  0.986568    10.134376    0.963938
2    Oslo  Friday   6.978572  0.573297    11.345484    1.454762
3    Oslo  Monday   4.594814       NaN    10.842988         NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM