简体   繁体   English

熊猫在多索引内占总行的百分比

[英]Pandas percentage of total row within multiindex

I have a dataframe that looks as follows: 我有一个数据框,如下所示:

df = pd.DataFrame([['Foo','A','Green',10,20],['Foo','A','Red',20,30],['Foo','A','Total',50,60],['Foo','B','Blue',5,10],['Foo','B','Red',15,25],['Foo','B','Total',40,100],['Foo','C','Orange',25,8],['Foo','C','Total',50,10]],columns = ['Default','Letter','Color','Value'])
print df

      Default Letter   Color  Value  Value2
0     Foo      A   Green     10      20
1     Foo      A     Red     20      30
2     Foo      A   Total     50      60
3     Foo      B    Blue      5      10
4     Foo      B     Red     15      25
5     Foo      B   Total     40     100
6     Foo      C  Orange     25       8
7     Foo      C   Total     50      10

I need to find the percentage of the total row that each color makes up within each group 我需要找到每种颜色在每组中占总行的百分比

My first thought was to split them into separate indexes, and use .div, but in this case I have a multiindex (I know in my example the first all says Foo, but that's not how the real data looks - roll with it.) and I get the notImplemented Error. 我首先想到的是将它们拆分为单独的索引,并使用.div,但是在这种情况下,我有一个多索引(我在我的示例中首先说的是Foo,但这并不是真实数据的样子-随其滚动。)我收到notImplemented错误。

df_color = df[df['Color']!='Total'].set_index(['Default','Letter','Color'])
df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1).set_index(['Default','Letter'])

df_out = df_color.div(df_tot)

NotImplementedError                       Traceback (most recent call last)
<ipython-input-119-0caf0e2959a6> in <module>()
      4 df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1).set_index(['Default','Letter'])
      5 
----> 6 df_out = df_color.div(df_tot)
      7 #df.set_index(['Default','Letter','Color'],inplace = True)...

Here is my desired output: 这是我想要的输出:

df_out = pd.DataFrame([['Foo','A','Green',.2,.333],['Foo','A','Red',.4,.5],['Foo','B','Blue',.125,.1],['Foo','B','Red',.375,.25],['Foo','C','Orange',.5,.8]],columns = ['Default','Letter','Color','Value','Value2'])
​
print df_out
df_out = pd.DataFrame([['Foo','A','Green',.2,.333],['Foo','A','Red',.4,.5],['Foo','B','Blue',.125,.1],['Foo','B','Red',.375,.25],['Foo','C','Orange',.5,.8]],columns = ['Default','Letter','Color','Value','Value2'])

EDIT note that there are actually multiple value columns - for simplicity I just show one here, but the solution needs to handle 50-100 numerical value columns. 编辑注意,实际上有多个值列-为简单起见,我仅在此处显示一个,但解决方案需要处理50-100个数值列。

You can do this with a groupby . 您可以使用groupby进行此操作。 Checkout the tutorial on using groupby. 查看有关使用groupby 的教程

Note : this implementation assumes that the Total entry for each color is the last one for that color (as in the example) but this is easily modifiable. 注意 :此实现假定每种颜色的Total条目是该颜色的最后一个条目(如示例中所示),但这很容易修改。

cols = [x for x in df.columns if x not  in ['Default', 'Letter', 'Color']]  # or df.columns[3:]
df.loc[:, cols] = df.groupby('Letter', group_keys=False).apply(lambda df: df[cols] / df[cols].iloc[-1])
df[~(df['Color'] == 'Total')]

returns 退货

  Default Letter   Color  Value    Value2
0     Foo      A   Green  0.200  0.333333
1     Foo      A     Red  0.400  0.500000
3     Foo      B    Blue  0.125  0.100000
4     Foo      B     Red  0.375  0.250000
6     Foo      C  Orange  0.500  0.800000

I ended up reformatting the datafames using the melt function so the column name became another column in the data. 我最终使用了melt函数重新格式化了数据帧,因此列名成为了数据中的另一列。 Then I could simply merge and divide, and reformat at the end 然后我可以简单地合并和划分,最后重新格式化

df = pd.DataFrame([['Foo','A','Green',10,20],['Foo','A','Red',20,30],['Foo','A','Total',50,60],['Foo','B','Blue',5,10],['Foo','B','Red',15,25],['Foo','B','Total',40,100],['Foo','C','Orange',25,8],['Foo','C','Total',50,10]],columns = ['Default','Letter','Color','Value','Value2'])

df_color = df[df['Color']!='Total']
df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1)

df_melt = pd.melt(df_color,id_vars = ['Default','Letter', 'Color'],var_name =['value_field'] )
df_tot_melt = pd.melt(df_tot,id_vars = ['Default','Letter'],var_name =['value_field'], value_name = 'Total')


df_melt_pct = pd.merge(df_melt, df_tot_melt, how = 'outer', on = ['Default','Letter','value_field'])
df_melt_pct['Pct'] = df_melt_pct['value'] /df_melt_pct['Total']
df_melt_pct = df_melt_pct.drop(['value','Total'],axis = 1).set_index(['Default','Letter','Color','value_field']).unstack()
df_melt_pct.columns = df_melt_pct.columns.droplevel(0)

print df_melt_pct

value_field            Value    Value2
Default Letter Color                  
Foo     A      Green   0.200  0.333333
               Red     0.400  0.500000
        B      Blue    0.125  0.100000
               Red     0.375  0.250000
        C      Orange  0.500  0.800000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM