[英]Pandas.DataFrame.GroupBy.agg, independent column needed in aggregation function. How to get it into agg?
I have a Pandas DataFrame object with a two-level MultiIndex.我有一个带有两级 MultiIndex 的 Pandas DataFrame object。 Furthermore it obviously contains a number of additional columns (eg 'A', 'B', 'C', 'D', 'E').
此外,它显然包含许多附加列(例如,'A'、'B'、'C'、'D'、'E')。 I want to execute some aggregation function on the individual multi indices in the DataFrame on each individual column from a subset of the available columns (say, 'C', 'D', 'E').
我想对来自可用列的子集的每个单独列的 DataFrame 中的各个多索引执行一些聚合 function(例如,'C'、'D'、'E')。 For this purpose I select only the subset of columns, use GroupBy to group the thus sliced data frame by
levels=[0,1]
and execute agg
with a dictionary configuring the aggregation function for each of the selected columns from the mentioned subset.为此,我 select 仅列的子集,使用 GroupBy 按
levels=[0,1]
对如此切片的数据帧进行分组,并使用字典执行agg
,该字典为上述子集中的每个选定列配置聚合 function。
df[['C', 'D', 'E']].groupby(level=[0, 1]).agg({'C': aggfunc, 'D': aggfunc, 'E': aggfunc})
My problem now is that in addition to the currently aggregated column that is handed into the aggregation function, I need a second column, eg 'B', in the aggregation function.我现在的问题是,除了当前被传递到聚合 function 的聚合列之外,我还需要聚合 function 中的第二列,例如“B”。 So it's basically an aggregation of two columns, one of ['C', 'D', 'E'] plus 'B'.
所以它基本上是两列的聚合,其中之一是 ['C', 'D', 'E'] 加上 'B'。
What I could do is replacing aggfunc
with a closure that knows 'B'.我能做的就是用一个知道“B”的闭包替换
aggfunc
。 Is that the only way?这是唯一的方法吗? Or is there a way to tell Pandas to also hand 'B' into the aggregation function in addition to 'C', 'D', 'E'?
或者有没有办法告诉 Pandas 除了“C”、“D”、“E”之外,还将“B”放入聚合 function 中?
I've created a Jupyter Notebook to generate example data.我创建了一个 Jupyter Notebook 来生成示例数据。 In the example, you can see the columns
serial
and turn
which form the MultiIndex, and the column milage
which is the independent column that I need in the aggregation function in addition to the columns m1
to m4
each.在示例中,除了列
m1
到m4
之外,您还可以看到形成 MultiIndex 的列serial
和turn
,以及列milage
,这是我在聚合 function 中需要的独立列。 So in the function I need m<n>
(whichever is currently processed) plus milage
.所以在 function 我需要
m<n>
(以当前处理的为准)加上milage
。 Since milage
is a float value too I cannot use it as index.由于
milage
也是一个浮点值,我不能将它用作索引。
The notebook can be found here: https://github.com/HWiese1980/public_notebooks/blob/master/example.ipynb笔记本可以在这里找到: https://github.com/HWiese1980/public_notebooks/blob/master/example.ipynb
Problem is agg
function 'see'
only processing columns, not another ones.问题是
agg
function 仅'see'
处理列,而不是其他列。
So it is possible, but not performant, because is necessary filtering per groups:所以这是可能的,但不是高性能的,因为每个组都需要过滤:
np.random.seed(2020)
cols = ["serial", "turn", "milage", "m1", "m2", "m3", "m4"]
df = pd.DataFrame(columns=cols).set_index("serial", "turn")
serials = ["11111", "11222", "12345"]
data = []
end = 0.0
for s in serials:
for t in range(np.random.randint(6)):
start = end + np.random.rand() * 1000.
end = start + np.random.rand() * 1000.
run_point_count = np.random.randint(high=10, low=5)
milages = np.linspace(start, end, run_point_count)
for entry in range(run_point_count):
d = np.hstack((np.array([s, t]), [milages[entry]], np.random.rand(4)))
_df = {}
for i, c in enumerate(cols):
_df[c] = d[i]
data.append(_df)
df_out = df.append(data, ignore_index=True, sort=True).set_index(["serial", "turn"])
df_out = df_out.astype(float)
#print (df_out)
def aggfunc(x):
return x.sum() + df_out.loc[x.index, "milage"].mean()
#need unique MultiIndex
df_out = df_out.set_index(df_out.groupby(level=[0, 1]).cumcount(), append=True)
df = (df_out.groupby(level=[0, 1])
.agg({'m1': aggfunc, 'm2': aggfunc, 'm3': aggfunc, 'm4': aggfunc}))
print (df)
m1 m2 m3 m4
serial turn
12345 0 735.612167 734.425345 733.988098 736.534878
1 1763.739719 1762.587273 1763.196721 1763.929828
2 2582.773092 2583.585509 2582.582403 2582.121202
Second solution is with convert column to FloatIndex
:第二种解决方案是将列转换为
FloatIndex
:
def aggfunc(x):
return x.sum() + np.mean(x.index.get_level_values(3))
df = (df_out.set_index('milage', append=True)
.groupby(level=[0, 1])
.agg({'m1': aggfunc, 'm2': aggfunc, 'm3': aggfunc, 'm4': aggfunc}))
print (df)
m1 m2 m3 m4
serial turn
12345 0 735.612167 734.425345 733.988098 736.534878
1 1763.739719 1762.587273 1763.196721 1763.929828
2 2582.773092 2583.585509 2582.582403 2582.121202
EDIT:编辑:
If possible use some function working with all column of DataFrame
use GroupBy.apply
:如果可能的话,使用一些 function 与 DataFrame 的所有列
DataFrame
使用GroupBy.apply
:
def f(x):
return x[['m1','m2','m3','m4']].sum() + x['milage'].mean()
df = df_out.groupby(level=[0, 1]).apply(f)
print (df)
m1 m2 m3 m4
serial turn
12345 0 735.612167 734.425345 733.988098 736.534878
1 1763.739719 1762.587273 1763.196721 1763.929828
2 2582.773092 2583.585509 2582.582403 2582.121202
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.