简体   繁体   English

Pandas 将多列堆叠成一列

[英]Pandas stack multiple columns to a single column

I have following DataFrame:我有以下数据帧:

                    ETHNIC                       RACE        AGE       TRT01A
0   NOT HISPANIC OR LATINO                      WHITE  31.824778  Treatment B
1   NOT HISPANIC OR LATINO                      WHITE  31.381246      Placebo
2       HISPANIC OR LATINO                      WHITE  45.522245  Treatment A
3       HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN  42.910335  Treatment B
4   NOT HISPANIC OR LATINO                      WHITE  31.381246      Placebo
5   NOT HISPANIC OR LATINO                      WHITE  38.045175  Treatment B
6       HISPANIC OR LATINO                      WHITE  39.337440      Placebo
7   NOT HISPANIC OR LATINO                      WHITE  47.121150      Placebo
8   NOT HISPANIC OR LATINO                      WHITE  38.203970  Treatment A
9   NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN  22.926762      Placebo
10      HISPANIC OR LATINO                      WHITE  45.226557  Treatment B
11      HISPANIC OR LATINO                      WHITE  32.112252      Placebo

Just copy above dataframe to clipboard and run df=pd.read_clipboard('\\s\\s+') to get the dataframe into a variable.只需将上面的数据帧复制到剪贴板并运行df=pd.read_clipboard('\\s\\s+')将数据帧放入一个变量中。

out = (df.groupby(['TRT01A','ETHNIC', 'RACE'])['AGE']
       .agg(mean=np.mean, 
            n='count', 
            deviation=np.std,
            Q1=lambda x: np.percentile(x, 0.25)
            )
       .T.unstack().unstack(0)
       )

I performed some aggregates in the above dataframe, and transposed, and successively unstacked them to get the following result:我在上面的数据帧中执行了一些聚合,并转置,并连续拆开它们以获得以下结果:

TRT01A                                                        Placebo  Treatment A  Treatment B
ETHNIC                 RACE                                                                    
HISPANIC OR LATINO     BLACK OR AFRICAN AMERICAN mean             NaN          NaN    42.910335
                                                 n                NaN          NaN     1.000000
                                                 deviation        NaN          NaN          NaN
                                                 Q1               NaN          NaN    42.910335
                       WHITE                     mean       35.724846    45.522245    45.226557
                                                 n           2.000000     1.000000     1.000000
                                                 deviation   5.108979          NaN          NaN
                                                 Q1         32.130315    45.522245    45.226557
NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN mean       22.926762          NaN          NaN
                                                 n           1.000000          NaN          NaN
                                                 deviation        NaN          NaN          NaN
                                                 Q1         22.926762          NaN          NaN
                       WHITE                     mean       36.627881    38.203970    34.934976
                                                 n           3.000000     1.000000     2.000000
                                                 deviation   9.087438          NaN     4.398485
                                                 Q1         31.381246    38.203970    31.840329

Now, I want to unstack all the indices to get the following structure (ie inserting NaN rows for all the index columns from first to second last, alongwith Level column denoting the level of the index):现在,我想解开所有索引以获得以下结构(即为所有索引列从第一个到第二个最后插入NaN行,以及表示索引Level列):

                             Placebo  Treatment A  Treatment B  Level
HISPANIC OR LATINO               NaN          NaN          NaN      0 <---
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1 <---
mean                             NaN          NaN    42.910335      2
n                                NaN          NaN     1.000000      2
deviation                        NaN          NaN          NaN      2
Q1                               NaN          NaN    42.910335      2
WHITE                            NaN          NaN          NaN      1 <---
mean                       35.724846    45.522245    45.226557      2
n                           2.000000     1.000000     1.000000      2
deviation                   5.108979          NaN          NaN      2
Q1                         32.130315    45.522245    45.226557      2
NOT HISPANIC OR LATINO           NaN          NaN          NaN      0 <---
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1 <---
mean                       22.926762          NaN          NaN      2
n                           1.000000          NaN          NaN      2
deviation                        NaN          NaN          NaN      2
Q1                         22.926762          NaN          NaN      2
WHITE                            NaN          NaN          NaN      1 <---
mean                       36.627881    38.203970    34.934976      2
n                           3.000000     1.000000     2.000000      2
deviation                   9.087438          NaN     4.398485      2
Q1                         31.381246    38.203970    31.840329      2   

This question is identical to the previous question that I asked , but the problem is, there can be from 1 to 4 indices columns after aggregating, (ie aggregate may be applied on from 1 to 5 columns), and it's being difficult to use the same previous solution in this scenario.这个问题与我问上一个问题相同,但问题是,聚合后可以有 1 到 4 个索引列,(即聚合可能应用于 1 到 5 列),并且很难使用在这种情况下与以前的解决方案相同。

Use custom function with DataFrame.append first with custom DataFrame filled by default NaN values:首先将自定义函数与DataFrame.append一起使用,自定义DataFrame由默认NaN值填充:

def f(x):
    names = pd.DataFrame(index=x.name, columns=x.columns).assign(Level=[0,1])
    #print (names)
    return names.append(x.reset_index(level=[0,1], drop=True).assign(Level=2))

out = out.groupby(level=[0,1], group_keys=False).apply(f)

And then remove duplicated 0 Levels:然后删除重复的0级:

out = out[~out.index.duplicated() | out['Level'].isin([1,2])]

print (out)
TRT01A                       Placebo  Treatment A  Treatment B  Level
HISPANIC OR LATINO               NaN          NaN          NaN      0
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1
mean                             NaN          NaN    42.910335      2
n                                NaN          NaN     1.000000      2
deviation                        NaN          NaN          NaN      2
Q1                               NaN          NaN    42.910335      2
WHITE                            NaN          NaN          NaN      1
mean                       35.724846    45.522245    45.226557      2
n                           2.000000     1.000000     1.000000      2
deviation                   5.108979          NaN          NaN      2
Q1                         32.130315    45.522245    45.226557      2
NOT HISPANIC OR LATINO           NaN          NaN          NaN      0
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1
mean                       22.926762          NaN          NaN      2
n                           1.000000          NaN          NaN      2
deviation                        NaN          NaN          NaN      2
Q1                         22.926762          NaN          NaN      2
WHITE                            NaN          NaN          NaN      1
mean                       36.627881    38.203970    34.934976      2
n                           3.000000     1.000000     2.000000      2
deviation                   9.087438          NaN     4.398485      2
Q1                         31.381246    38.203970    31.840329      2
    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM