在熊猫指数级别的Groupby

Question

I have the following MultiIndex DataFrame and I'm wondering if there is a way to apply different functions on the second level index. 我有下面的MultiIndex DataFrame，我想知道是否有办法在第二级索引上应用不同的功能。

import pandas as pd
# Creation
df1 = pd.DataFrame([[1,2,1],[4,5,1],[4,5,2]], columns=["M1","M2","month"])
df1['var']="v1"
df2 = pd.DataFrame([[1.5,2.5,1],[4.5,5.5,1],[1.5,1.5,2]], columns=["M1","M2","month"])
df2['var']="v2"
df_all = pd.concat([df1,df2],join='outer')

# Final DataFrame
df_all_idx = df_all.set_index(["month","var"],inplace=False)
df_all_idx.sort_index(level=[0])

                M1  M2
month    var        
1        v1     1.0 2.0
         v1     4.0 5.0
         v2     1.5 2.5
         v2     4.5 5.5
2        v1     4.0 5.0
         v2     1.5 1.5

With groupby I can obtain: 使用groupby，我可以获得：

df_grp = df_all_idx.groupby(by=["month","var"]).sum()

                M1  M2
month   var     
1       v1      5.0 7.0
        v2      6.0 8.0
2       v1      4.0 5.0
        v2      1.5 1.5

For example, I would need to apply sum() to v1 values and a custom function to v2 values. 例如，我需要将sum（）应用于v1值，并将自定义函数应用于v2值。

Thanks 谢谢

Answer 1

I like dictionaries. 我喜欢字典。 So I would store your aggregating functions in a dictionary, and look them up based on each group's name. 因此，我会将您的汇总函数存储在字典中，然后根据每个组的名称进行查找。

import numpy
import pandas

aggregators = {
    'v2': numpy.min
}


df1 = pandas.DataFrame(
    [[1, 2, 1],[4, 5, 1],[4, 5, 2]],
    columns=["M1", "M2", "month"]
).assign(var='v1')

df2 = pandas.DataFrame(
    [[1.5,2.5,1], [4.5,5.5,1], [1.5,1.5,2]],
    columns=["M1", "M2", "month"]
).assign(var='v2')

df = (
    pandas.concat([df1, df2], join='outer')
        .groupby(by=['month', 'var'])
        .apply(lambda g: aggregators.get(g.name[-1], numpy.sum)(g))
        [['M1', 'M2']]
)

And that's: 那就是：

            M1   M2
month var          
1     v1   5.0  7.0
      v2   1.5  2.5
2     v1   4.0  5.0
      v2   1.5  1.5

This line: .apply(lambda g: aggregators.get(g.name[-1], numpy.sum)(g)) is a little complicated. 这行代码： .apply(lambda g: aggregators.get(g.name[-1], numpy.sum)(g))有点复杂。 Here's what it does: 这是它的作用：

.apply loops through all of the groups and runs them through the lambda .apply遍历所有组并通过lambda运行它们
Each group has a name attribute that is the values of grouping columns 每个组都有一个name属性，它是分组列的值
g.name[-1] is the last element (v1, v2) g.name[-1]是最后一个元素（v1，v2）
aggregators.get(g.name[-1], numpy.sum) looks up the function to use, but if a function can't be found, it defaults to numpy.sum aggregators.get(g.name[-1], numpy.sum)查找要使用的函数，但是如果找不到函数，则默认为numpy.sum
then we pass the group to the function that we looked up 然后将组传递给我们查找的功能

Answer 2

Would something like this work? 这样的事情行吗？

df_all_idx.xs('v1', level=1).sum(axis=1)
df_all_idx.xs('v2', level=1).apply(some_function, axis=1)

Answer 3

Following the suggestion to split, apply and concat back I came up with this solution: 按照拆分，应用和合并的建议，我提出了以下解决方案：

def myfunc(x):
    return np.mean(x)

p1 = df_all_idx.loc[(slice(None), 'v1'), :].groupby(by=["month","var"]).sum()    
p2 = df_all_idx.loc[(slice(None), 'v2'), :].groupby(by=["month","var"]).agg(myfunc)

pd.concat([p1,p2], join='outer').sort_index(level=[0])

That return the result as I want it: 那返回了我想要的结果：

                 M1     M2
 month  var     
 1       v1      5.0    7.0
         v2      3.0    4.0
 2       v1      4.0    5.0
         v2      1.5    1.5

I assume then that this is the best practice in this case. 然后，我认为这是这种情况下的最佳实践。

在熊猫指数级别的Groupby

问题描述

3 个解决方案

解决方案1
3 已采纳 2018-05-17 18:38:05

解决方案2
0 2018-05-17 17:56:01

解决方案3
0 2018-05-17 18:30:27

在熊猫指数级别的Groupby

问题描述

3 个解决方案

解决方案1 3 已采纳 2018-05-17 18:38:05

解决方案2 0 2018-05-17 17:56:01

解决方案3 0 2018-05-17 18:30:27

解决方案1
3 已采纳 2018-05-17 18:38:05

解决方案2
0 2018-05-17 17:56:01

解决方案3
0 2018-05-17 18:30:27