[英]R group_by %>% summarise equivalent in pandas
I'm trying to rewrite some code from R to python.我正在尝试将一些代码从 R 重写为 python。
my df is something like我的 df 是这样的
size = 20
np.random.seed(456)
df = pd.DataFrame({"names": np.random.choice(["bob", "alb", "jr"], size=size, replace=True),
"income": np.random.normal(size=size, loc=1000, scale=100),
"costs": np.random.normal(size=size, loc=500, scale=100),
"date": np.random.choice(pd.date_range("2018-01-01", "2018-01-06"),
size=size, replace=True)
})
Now I need to group the df by name and then perform some summarize operations.现在我需要按名称对 df 进行分组,然后执行一些汇总操作。
In R, dplyr, I'm doing在 R 中,dplyr,我正在做
dfg <- group_by(df, names) %>%
summarise(
income.acc = sum(income),
costs.acc = sum(costs),
net = sum(income) - sum(costs),
income.acc.bymax = sum(income[date==max(date)]),
cost.acc.bymax = sum(costs[date==max(date)]),
growth = income.acc.bymax + cost.acc.bymax - net
)
Please note that I'm just trying to ilustrate my data, it doesn't mean anything.请注意,我只是想说明我的数据,它没有任何意义。
How can I achieve the same result using pandas?如何使用熊猫实现相同的结果?
I'm having a hard time because df.groupby().agg() is very limited!我很难过,因为 df.groupby().agg() 非常有限!
Using RI get:使用 RI 获取:
> print(dfg)
# A tibble: 3 x 7
names income.acc costs.acc net income.acc.bymax cost.acc.bymax growth
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 alb 7997 3996 4001 2998 1501 497
2 bob 6003 3004 3000 2002 1002 3.74
3 jr 6002 3000 3002 1000 499 -1503
Using @Jezrael answer:使用@Jezrael 回答:
I get我得到
income_acc costs_acc net income_acc_bymax \
names
alb 7997.466538 3996.053670 4001.412868 2997.855009
bob 6003.488978 3003.540598 2999.948380 2001.533870
jr 6002.056904 3000.346010 3001.710894 999.833162
cost_acc_bymax growth
names
alb 1500.876851 497.318992
bob 1002.151162 3.736652
jr 499.328510 -1502.549221
I think you need custom function:我认为您需要自定义功能:
def f(x):
income_acc = x.income.sum()
costs_acc = x.costs.sum()
net = income_acc - costs_acc
income_acc_bymax = x.loc[x.date == x.date.max(), 'income'].sum()
cost_acc_bymax = x.loc[x.date == x.date.max(), 'costs'].sum()
growth = income_acc_bymax + cost_acc_bymax - net
c = ['income_acc','costs_acc','net','income_acc_bymax','cost_acc_bymax','growth']
return pd.Series([income_acc, costs_acc, net, income_acc_bymax, cost_acc_bymax, growth],
index=c)
df1 = df.groupby('names').apply(f)
print (df1)
income_acc costs_acc net income_acc_bymax \
names
alb 7746.653816 3605.367002 4141.286814 2785.500946
bob 6348.897809 3354.059777 2994.838032 2153.386953
jr 6205.690386 3034.601030 3171.089356 983.316234
cost_acc_bymax growth
names
alb 1587.685103 231.899235
bob 1215.116245 373.665167
jr 432.851030 -1754.922093
Now you can do it with datar
in the same way as you did in R:现在你可以像在 R 中一样使用
datar
来完成它:
>>> from datar.all import f, group_by, summarise, sum, max
>>>
>>> dfg = group_by(df, f.names) >> summarise(
... income_acc = sum(f.income),
... costs_acc = sum(f.costs),
... net = sum(f.income) - sum(f.costs),
... income_acc_bymax = sum(f.income[f.date==max(f.date)]),
... cost_acc_bymax = sum(f.costs[f.date==max(f.date)]),
... growth = f.income_acc_bymax + f.cost_acc_bymax - f.net
... )
>>> dfg
names income_acc costs_acc net income_acc_bymax cost_acc_bymax growth
<object> <float64> <float64> <float64> <float64> <float64> <float64>
0 alb 7746.653816 3605.367002 4141.286814 2785.500946 1587.685103 231.899235
1 bob 6348.897809 3354.059777 2994.838032 2153.386953 1215.116245 373.665167
2 jr 6205.690386 3034.601030 3171.089356 983.316234 432.851030 -1754.922093
I am the author of the package.我是包的作者。 Feel free to submit issues if you have any questions.
如果您有任何问题,请随时提交问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.