简体   繁体   English

R group_by %>% 总结大熊猫中的等效项

[英]R group_by %>% summarise equivalent in pandas

I'm trying to rewrite some code from R to python.我正在尝试将一些代码从 R 重写为 python。

my df is something like我的 df 是这样的

size = 20
np.random.seed(456)
df = pd.DataFrame({"names": np.random.choice(["bob", "alb", "jr"], size=size, replace=True),
                  "income": np.random.normal(size=size, loc=1000, scale=100),
                  "costs": np.random.normal(size=size, loc=500, scale=100),
                   "date": np.random.choice(pd.date_range("2018-01-01", "2018-01-06"),
                                           size=size, replace=True)
                  })

Now I need to group the df by name and then perform some summarize operations.现在我需要按名称对 df 进行分组,然后执行一些汇总操作。

In R, dplyr, I'm doing在 R 中,dplyr,我正在做

 dfg <- group_by(df, names) %>%
    summarise(
            income.acc = sum(income),
            costs.acc = sum(costs),
            net = sum(income) - sum(costs),
            income.acc.bymax = sum(income[date==max(date)]),
            cost.acc.bymax = sum(costs[date==max(date)]),
            growth =  income.acc.bymax + cost.acc.bymax - net
    )

Please note that I'm just trying to ilustrate my data, it doesn't mean anything.请注意,我只是想说明我的数据,它没有任何意义。

How can I achieve the same result using pandas?如何使用熊猫实现相同的结果?

I'm having a hard time because df.groupby().agg() is very limited!我很难过,因为 df.groupby().agg() 非常有限!


Using RI get:使用 RI 获取:

> print(dfg)
# A tibble: 3 x 7
  names income.acc costs.acc   net income.acc.bymax cost.acc.bymax   growth
  <chr>      <dbl>     <dbl> <dbl>            <dbl>          <dbl>    <dbl>
1 alb         7997      3996  4001             2998           1501   497   
2 bob         6003      3004  3000             2002           1002     3.74
3 jr          6002      3000  3002             1000            499 -1503  

Using @Jezrael answer:使用@Jezrael 回答:

I get我得到

         income_acc    costs_acc          net  income_acc_bymax  \
names                                                            
alb    7997.466538  3996.053670  4001.412868       2997.855009   
bob    6003.488978  3003.540598  2999.948380       2001.533870   
jr     6002.056904  3000.346010  3001.710894        999.833162   

       cost_acc_bymax       growth  
names                               
alb       1500.876851   497.318992  
bob       1002.151162     3.736652  
jr         499.328510 -1502.549221 

I think you need custom function:我认为您需要自定义功能:

def f(x):
    income_acc = x.income.sum()
    costs_acc = x.costs.sum()
    net = income_acc - costs_acc
    income_acc_bymax = x.loc[x.date == x.date.max(), 'income'].sum()
    cost_acc_bymax = x.loc[x.date == x.date.max(), 'costs'].sum()
    growth =  income_acc_bymax + cost_acc_bymax - net
    c = ['income_acc','costs_acc','net','income_acc_bymax','cost_acc_bymax','growth']
    return pd.Series([income_acc, costs_acc, net, income_acc_bymax, cost_acc_bymax, growth], 
                     index=c)

df1 = df.groupby('names').apply(f)
print (df1)
        income_acc    costs_acc          net  income_acc_bymax  \
names                                                            
alb    7746.653816  3605.367002  4141.286814       2785.500946   
bob    6348.897809  3354.059777  2994.838032       2153.386953   
jr     6205.690386  3034.601030  3171.089356        983.316234   

       cost_acc_bymax       growth  
names                               
alb       1587.685103   231.899235  
bob       1215.116245   373.665167  
jr         432.851030 -1754.922093  

Now you can do it with datar in the same way as you did in R:现在你可以像在 R 中一样使用datar来完成它:

>>> from datar.all import f, group_by, summarise, sum, max
>>> 
>>> dfg = group_by(df, f.names) >> summarise(
...     income_acc = sum(f.income),
...     costs_acc = sum(f.costs),
...     net = sum(f.income) - sum(f.costs),
...     income_acc_bymax = sum(f.income[f.date==max(f.date)]),
...     cost_acc_bymax = sum(f.costs[f.date==max(f.date)]),
...     growth =  f.income_acc_bymax + f.cost_acc_bymax - f.net
... )
>>> dfg
     names   income_acc    costs_acc          net  income_acc_bymax  cost_acc_bymax       growth
  <object>    <float64>    <float64>    <float64>         <float64>       <float64>    <float64>
0      alb  7746.653816  3605.367002  4141.286814       2785.500946     1587.685103   231.899235
1      bob  6348.897809  3354.059777  2994.838032       2153.386953     1215.116245   373.665167
2       jr  6205.690386  3034.601030  3171.089356        983.316234      432.851030 -1754.922093

I am the author of the package.我是包的作者。 Feel free to submit issues if you have any questions.如果您有任何问题,请随时提交问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM