简体   繁体   English

Pandas.DataFrame.GroupBy.agg,聚合中需要独立的列function。 如何将其放入 agg 中?

[英]Pandas.DataFrame.GroupBy.agg, independent column needed in aggregation function. How to get it into agg?

I have a Pandas DataFrame object with a two-level MultiIndex.我有一个带有两级 MultiIndex 的 Pandas DataFrame object。 Furthermore it obviously contains a number of additional columns (eg 'A', 'B', 'C', 'D', 'E').此外,它显然包含许多附加列(例如,'A'、'B'、'C'、'D'、'E')。 I want to execute some aggregation function on the individual multi indices in the DataFrame on each individual column from a subset of the available columns (say, 'C', 'D', 'E').我想对来自可用列的子集的每个单独列的 DataFrame 中的各个多索引执行一些聚合 function(例如,'C'、'D'、'E')。 For this purpose I select only the subset of columns, use GroupBy to group the thus sliced data frame by levels=[0,1] and execute agg with a dictionary configuring the aggregation function for each of the selected columns from the mentioned subset.为此,我 select 仅列的子集,使用 GroupBy 按levels=[0,1]对如此切片的数据帧进行分组,并使用字典执行agg ,该字典为上述子集中的每个选定列配置聚合 function。

df[['C', 'D', 'E']].groupby(level=[0, 1]).agg({'C': aggfunc, 'D': aggfunc, 'E': aggfunc})

My problem now is that in addition to the currently aggregated column that is handed into the aggregation function, I need a second column, eg 'B', in the aggregation function.我现在的问题是,除了当前被传递到聚合 function 的聚合列之外,我还需要聚合 function 中的第二列,例如“B”。 So it's basically an aggregation of two columns, one of ['C', 'D', 'E'] plus 'B'.所以它基本上是两列的聚合,其中之一是 ['C', 'D', 'E'] 加上 'B'。

What I could do is replacing aggfunc with a closure that knows 'B'.我能做的就是用一个知道“B”的闭包替换aggfunc Is that the only way?这是唯一的方法吗? Or is there a way to tell Pandas to also hand 'B' into the aggregation function in addition to 'C', 'D', 'E'?或者有没有办法告诉 Pandas 除了“C”、“D”、“E”之外,还将“B”放入聚合 function 中?

Example notebook示例笔记本

I've created a Jupyter Notebook to generate example data.我创建了一个 Jupyter Notebook 来生成示例数据。 In the example, you can see the columns serial and turn which form the MultiIndex, and the column milage which is the independent column that I need in the aggregation function in addition to the columns m1 to m4 each.在示例中,除了列m1m4之外,您还可以看到形成 MultiIndex 的列serialturn ,以及列milage ,这是我在聚合 function 中需要的独立列。 So in the function I need m<n> (whichever is currently processed) plus milage .所以在 function 我需要m<n> (以当前处理的为准)加上milage Since milage is a float value too I cannot use it as index.由于milage也是一个浮点值,我不能将它用作索引。

The notebook can be found here: https://github.com/HWiese1980/public_notebooks/blob/master/example.ipynb笔记本可以在这里找到: https://github.com/HWiese1980/public_notebooks/blob/master/example.ipynb

Problem is agg function 'see' only processing columns, not another ones.问题是agg function 仅'see'处理列,而不是其他列。

So it is possible, but not performant, because is necessary filtering per groups:所以这是可能的,但不是高性能的,因为每个组都需要过滤:

np.random.seed(2020)
cols = ["serial", "turn", "milage", "m1", "m2", "m3", "m4"]
df = pd.DataFrame(columns=cols).set_index("serial", "turn")

serials = ["11111", "11222", "12345"]

data = []
end = 0.0
for s in serials:
    for t in range(np.random.randint(6)):
        start = end + np.random.rand() * 1000.
        end = start + np.random.rand() * 1000.
        run_point_count = np.random.randint(high=10, low=5)
        milages = np.linspace(start, end, run_point_count)
        for entry in range(run_point_count):
            d = np.hstack((np.array([s, t]), [milages[entry]], np.random.rand(4)))
            _df = {}
            for i, c in enumerate(cols): 
                _df[c] = d[i]
            data.append(_df)

df_out = df.append(data, ignore_index=True, sort=True).set_index(["serial", "turn"])
df_out = df_out.astype(float)
#print (df_out)

def aggfunc(x):
    return x.sum() + df_out.loc[x.index, "milage"].mean()

#need unique MultiIndex
df_out = df_out.set_index(df_out.groupby(level=[0, 1]).cumcount(), append=True)
df = (df_out.groupby(level=[0, 1])
           .agg({'m1': aggfunc, 'm2': aggfunc, 'm3': aggfunc, 'm4': aggfunc}))
print (df)
                      m1           m2           m3           m4
serial turn                                                    
12345  0      735.612167   734.425345   733.988098   736.534878
       1     1763.739719  1762.587273  1763.196721  1763.929828
       2     2582.773092  2583.585509  2582.582403  2582.121202

Second solution is with convert column to FloatIndex :第二种解决方案是将列转换为FloatIndex

def aggfunc(x):
    return x.sum() + np.mean(x.index.get_level_values(3))


df = (df_out.set_index('milage', append=True)
           .groupby(level=[0, 1])
           .agg({'m1': aggfunc, 'm2': aggfunc, 'm3': aggfunc, 'm4': aggfunc}))
print (df)
                      m1           m2           m3           m4
serial turn                                                    
12345  0      735.612167   734.425345   733.988098   736.534878
       1     1763.739719  1762.587273  1763.196721  1763.929828
       2     2582.773092  2583.585509  2582.582403  2582.121202

EDIT:编辑:

If possible use some function working with all column of DataFrame use GroupBy.apply :如果可能的话,使用一些 function 与 DataFrame 的所有列DataFrame使用GroupBy.apply

def f(x):
       return x[['m1','m2','m3','m4']].sum() + x['milage'].mean()

df = df_out.groupby(level=[0, 1]).apply(f)
print (df)
                      m1           m2           m3           m4
serial turn                                                    
12345  0      735.612167   734.425345   733.988098   736.534878
       1     1763.739719  1762.587273  1763.196721  1763.929828
       2     2582.773092  2583.585509  2582.582403  2582.121202

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM