I have a Pandas DataFrame containing several categorical variables. For example:
import pandas as pd
d = {'grade':['A','B','C','A','B'],
'year':['2013','2013','2013','2012','2012']}
df = pd.DataFrame(d)
I would like to transform this to a MultiIndex DataFrame with the following properties:
For example:
Could anyone suggest a method for creating this MultiIndex DataFrame?
Another way you can do this to use melt
and groupby
:
df_out = df.melt().groupby(['variable','value']).size().to_frame(name='n')
df_out['proportion'] = df_out['n'].div(df_out.n.sum(level=0),level=0)
print(df_out)
Output:
n proportion
variable value
grade A 2 0.4
B 2 0.4
C 1 0.2
year 2012 2 0.4
2013 3 0.6
And, if you really want to get crazy and do it in a one-liner:
(df.melt().groupby(['variable','value']).size().to_frame(name='n')
.pipe(lambda x: x.assign(proportion = x[['n']]/x.groupby(level=0).transform('sum'))))
Upgraded solution using @Wen pct calculation:
(df.melt().groupby(['variable','value']).size().to_frame(name='n')
.pipe(lambda x: x.assign(proportion = x['n'].div(x.n.sum(level=0),level=0))))
You can try this ..
df1=df.apply(pd.value_counts).stack().swaplevel(0,1).to_frame('n')
df1['pct']=df1['n'].div(df1.n.sum(level=0),level=0)
df1
Out[89]:
n pct
year 2012 2.0 0.4
2013 3.0 0.6
grade A 2.0 0.4
B 2.0 0.4
C 1.0 0.2
Stey by step method:
df1 = df.groupby("grade").count()
df2 = df.groupby("year").count()
df1.columns = ['n']
df2.columns = ['n']
df1['proportion'] = df1.divide(df1.sum())
df2['proportion'] = df2.divide(df2.sum())
df_new = pd.concat([df1, df2], keys=['grade', 'year'], names=['variable'])
concat
, one can assign keys
that would be the outermost-layer index. Also assign name to this new index with names=
. The DataFrame can be created by stacking each variable in a loop, but this seems inefficient. eg:
d_end = []
for c in df.columns:
temp_df = pd.DataFrame(df[c].value_counts().rename('n'))
temp_df['proportion'] = temp_df['n'] / temp_df['n'].sum()
temp_df['variable'] = c
temp_df.set_index(['variable',temp_df.index],inplace=True)
d_end.append(temp_df)
df_end = pd.concat(d_end,axis=0)
I'm hoping someone can suggest a better way, avoiding the loop.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.