简体   繁体   English

如何在熊猫中筛选分组依据

[英]How to filter grouped by in Pandas

I am a newbie in Pandas. 我是熊猫的新手。 I have the following dataset. 我有以下数据集。 Think of the dataset as departments('k1') and people('k2') of a company. 将数据集视为公司的部门('k1')和人员('k2')。

dframe = pd.DataFrame({'k1': ['X','X','Y','Y','Z','Z'],
   ...:                         'k2': ['P1','P2','P3','P4','P5','P6'],
   ...:                          'dataset1': np.random.randn(6)})
   ...:

If I take the mean grouped by dept/'k1' I get the following 如果我采用按部门/“ k1”分组的均值,则得到以下结果

   dataset1
k1
X   0.153825
Y  -0.648500
Z   1.133334

If I take the mean grouped by people/'k2', I get the following 如果我按人/“ k2”分组的均值,我得到以下结果

In [6]: dframe.groupby('k2').mean()
Out[6]:
    dataset1
k2
P1  1.595455
P2 -1.287805
P3  0.211858
P4 -1.508859
P5  1.350336
P6  0.916332

My question is how can I filter only the mean values grouped by people/'k2' which is greater than the mean of dept/'k1' to which it belongs. 我的问题是如何仅过滤由people /'k2'分组的平均值,该平均值大于其所属的dept /'k1'的平均值。 eg P1 mean value is greater than X mean value to which it belongs 例如,P1平均值大于其所属的X平均值

Out[6]:
    dataset1
k2
P1  1.595455
P3  0.211858
P5  1.350336

Sample (changed P6 to P5 ): 样本(将P6更改为P5 ):

np.random.seed(45)
dframe = pd.DataFrame({'k1': ['X','X','Y','Y','Z','Z'],
                         'k2': ['P1','P2','P3','P4','P5','P5'],
                            'dataset1': np.random.randn(6)})

print (dframe)
   dataset1 k1  k2
0  0.026375  X  P1
1  0.260322  X  P2
2 -0.395146  Y  P3
3 -0.204301  Y  P4
4 -1.271633  Z  P5
5 -2.596879  Z  P5

First create new column by groupby and transform : 首先通过groupby创建新列并进行transform

dframe['meank1'] = dframe.groupby('k1').transform('mean')
print (dframe)
   dataset1 k1  k2    meank1
0  0.026375  X  P1  0.143348
1  0.260322  X  P2  0.143348
2 -0.395146  Y  P3 -0.299723
3 -0.204301  Y  P4 -0.299723
4 -1.271633  Z  P5 -1.934256
5 -2.596879  Z  P5 -1.934256

Then aggregate by agg mean and first , also is necessary add k1 column to groupby for avoid wrong output if same k2 in another k1 . 然后以agg mean聚合, first ,也有必要将k1列添加到groupby ,以免在另一个k1相同的k2时输出错误。

dframe = dframe.groupby(['k1','k2']).agg({'dataset1':'mean', 'meank1':'first'})
print (dframe)
         meank1  dataset1
k1 k2                    
X  P1  0.143348  0.026375
   P2  0.143348  0.260322
Y  P3 -0.299723 -0.395146
   P4 -0.299723 -0.204301
Z  P5 -1.934256 -1.934256

Last filter by boolean indexing or query : 最后通过boolean indexingquery过滤:

dframe = dframe.loc[dframe['meank1'] > dframe['dataset1'], ['dataset1']]
#alternative sol
#dframe = dframe.query('meank1 > dataset1')[['dataset1']]
print (dframe)
       dataset1
k1 k2          
X  P1  0.026375
Y  P3 -0.395146

And if want remove first level of MultiIndex add reset_index : 如果要删除MultiIndex第一级, MultiIndex添加reset_index

dframe = dframe.reset_index(level=0, drop=True)
print (dframe)
    dataset1
k2          
P1  0.026375
P3 -0.395146

For column from index use: 对于索引中的列,请使用:

dframe = dframe.reset_index(level=0, drop=True).reset_index()
print (dframe)
   k2  dataset1
0  P1  0.026375
1  P3 -0.395146

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM