I am trying to do a custom aggregation (along with several other standard aggregations).
Something like this:
df = pd.DataFrame(
[["red", 1, 10], ["red", 2, 20], ["green", 5, 15]],
columns=["color", "x", "y"]
)
df2 = (
df
.groupby(["color"])
.agg(amt1=("x", "sum"),
amt2=("x", "mean"),
amt3=("y", "sum"),
# this does not work...
amt4= (0.9 * (x.sum() - y.mean()) / x.max()) + 1
)
)
df2
Thanks for any help.
I don't think it is possible directly to use two columns in the custom function with agg
, you have two choices here. Either use apply
for this specific custom function and concat
with agg
the others, or use index based selection.
# option 1
gr = df.groupby(["color"])
df2 = pd.concat([gr.agg(amt1=("x", "sum"), amt2=("x", "mean"), amt3=("y", "sum")),
gr.apply(lambda dfg: (0.9 * (dfg.x.sum() - df.y.mean())
/ dfg.x.max()) + 1)
.rename('amt4')],
axis=1 )
# option 2
df2 = (df.groupby(["color"])
.aggregate(amt1=("x", "sum"), amt2=("x", "mean"), amt3=("y", "sum"),
amt4= ('x', lambda x: (0.9 * (x.sum() - df.loc[x.index, 'y'].mean())
/ x.max()) + 1))
)
both give the same result as long as the index are unique in df
to use option 2 in the new version need regular function bug description
def named_lambda(x):
return (0.9 * (x.sum() - df.loc[x.index, 'y'].mean()) / x.max()) + 1
df2 = (df.groupby(["color"])
.aggregate(amt1=("x", "sum"), amt2=("x", "mean"), amt3=("y", "sum"),
amt4= ('x', named_lambda))
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.