[英]Pandas stack multiple columns to a single column
我有以下数据帧:
ETHNIC RACE AGE TRT01A
0 NOT HISPANIC OR LATINO WHITE 31.824778 Treatment B
1 NOT HISPANIC OR LATINO WHITE 31.381246 Placebo
2 HISPANIC OR LATINO WHITE 45.522245 Treatment A
3 HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 42.910335 Treatment B
4 NOT HISPANIC OR LATINO WHITE 31.381246 Placebo
5 NOT HISPANIC OR LATINO WHITE 38.045175 Treatment B
6 HISPANIC OR LATINO WHITE 39.337440 Placebo
7 NOT HISPANIC OR LATINO WHITE 47.121150 Placebo
8 NOT HISPANIC OR LATINO WHITE 38.203970 Treatment A
9 NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 22.926762 Placebo
10 HISPANIC OR LATINO WHITE 45.226557 Treatment B
11 HISPANIC OR LATINO WHITE 32.112252 Placebo
只需将上面的数据帧复制到剪贴板并运行df=pd.read_clipboard('\\s\\s+')
将数据帧放入一个变量中。
out = (df.groupby(['TRT01A','ETHNIC', 'RACE'])['AGE']
.agg(mean=np.mean,
n='count',
deviation=np.std,
Q1=lambda x: np.percentile(x, 0.25)
)
.T.unstack().unstack(0)
)
我在上面的数据帧中执行了一些聚合,并转置,并连续拆开它们以获得以下结果:
TRT01A Placebo Treatment A Treatment B
ETHNIC RACE
HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN mean NaN NaN 42.910335
n NaN NaN 1.000000
deviation NaN NaN NaN
Q1 NaN NaN 42.910335
WHITE mean 35.724846 45.522245 45.226557
n 2.000000 1.000000 1.000000
deviation 5.108979 NaN NaN
Q1 32.130315 45.522245 45.226557
NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN mean 22.926762 NaN NaN
n 1.000000 NaN NaN
deviation NaN NaN NaN
Q1 22.926762 NaN NaN
WHITE mean 36.627881 38.203970 34.934976
n 3.000000 1.000000 2.000000
deviation 9.087438 NaN 4.398485
Q1 31.381246 38.203970 31.840329
现在,我想解开所有索引以获得以下结构(即为所有索引列从第一个到第二个最后插入NaN
行,以及表示索引Level
列):
Placebo Treatment A Treatment B Level
HISPANIC OR LATINO NaN NaN NaN 0 <---
BLACK OR AFRICAN AMERICAN NaN NaN NaN 1 <---
mean NaN NaN 42.910335 2
n NaN NaN 1.000000 2
deviation NaN NaN NaN 2
Q1 NaN NaN 42.910335 2
WHITE NaN NaN NaN 1 <---
mean 35.724846 45.522245 45.226557 2
n 2.000000 1.000000 1.000000 2
deviation 5.108979 NaN NaN 2
Q1 32.130315 45.522245 45.226557 2
NOT HISPANIC OR LATINO NaN NaN NaN 0 <---
BLACK OR AFRICAN AMERICAN NaN NaN NaN 1 <---
mean 22.926762 NaN NaN 2
n 1.000000 NaN NaN 2
deviation NaN NaN NaN 2
Q1 22.926762 NaN NaN 2
WHITE NaN NaN NaN 1 <---
mean 36.627881 38.203970 34.934976 2
n 3.000000 1.000000 2.000000 2
deviation 9.087438 NaN 4.398485 2
Q1 31.381246 38.203970 31.840329 2
这个问题与我问的上一个问题相同,但问题是,聚合后可以有 1 到 4 个索引列,(即聚合可能应用于 1 到 5 列),并且很难使用在这种情况下与以前的解决方案相同。
首先将自定义函数与DataFrame.append
一起使用,自定义DataFrame
由默认NaN
值填充:
def f(x):
names = pd.DataFrame(index=x.name, columns=x.columns).assign(Level=[0,1])
#print (names)
return names.append(x.reset_index(level=[0,1], drop=True).assign(Level=2))
out = out.groupby(level=[0,1], group_keys=False).apply(f)
然后删除重复的0
级:
out = out[~out.index.duplicated() | out['Level'].isin([1,2])]
print (out)
TRT01A Placebo Treatment A Treatment B Level
HISPANIC OR LATINO NaN NaN NaN 0
BLACK OR AFRICAN AMERICAN NaN NaN NaN 1
mean NaN NaN 42.910335 2
n NaN NaN 1.000000 2
deviation NaN NaN NaN 2
Q1 NaN NaN 42.910335 2
WHITE NaN NaN NaN 1
mean 35.724846 45.522245 45.226557 2
n 2.000000 1.000000 1.000000 2
deviation 5.108979 NaN NaN 2
Q1 32.130315 45.522245 45.226557 2
NOT HISPANIC OR LATINO NaN NaN NaN 0
BLACK OR AFRICAN AMERICAN NaN NaN NaN 1
mean 22.926762 NaN NaN 2
n 1.000000 NaN NaN 2
deviation NaN NaN NaN 2
Q1 22.926762 NaN NaN 2
WHITE NaN NaN NaN 1
mean 36.627881 38.203970 34.934976 2
n 3.000000 1.000000 2.000000 2
deviation 9.087438 NaN 4.398485 2
Q1 31.381246 38.203970 31.840329 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.