[英]Sum by distinct column value in R
I have a very large dataframe in R and would like to sum two columns for every distinct value in other columns, for example say we had data of a dataframe of transactions in various shops over a day as follows 我在R中有一个非常大的数据框,并希望在其他列中为每个不同的值加上两列,例如,我们在一天内有各种商店的交易数据框的数据,如下所示
shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3),
'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'),
'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'),
'sale' = c(12, 5, 9, 15, 10, 18),
'profit' = c(3, 1, 3, 6, 5, 9))
which is: 这是:
shop_id shop_name city sale profit
1 Shop A London 12 3
1 Shop A London 5 1
1 Shop A London 9 3
2 Shop B Cardiff 15 6
3 Shop C Dublin 10 5
3 Shop C Dublin 18 9
And I'd want to sum the sale and profit for each shop to give: 而且我想总结每家商店的销售和利润:
shop_id shop_name city sale profit
1 Shop A London 26 7
2 Shop B Cardiff 15 6
3 Shop C Dublin 28 14
I am currently using the following code to do this: 我目前正在使用以下代码执行此操作:
shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
shop_day <- subset(shop_day, !duplicated(shop_id))
which works absolutely fine, but as I said my dataframe is large (140,000 rows, 37 columns and nearly 100,000 unique rows which I want to sum) and my code takes ages to run and then eventually says it has run out of memory. 哪个工作绝对正常,但正如我所说的我的数据帧很大(140,000行,37列和近100,000个我想要求和的唯一行)并且我的代码需要很长时间才能运行,然后最终说它已经耗尽了内存。
Does anyone know of the most efficient way to do this. 有谁知道最有效的方法来做到这一点。
Thanks in advance! 提前致谢!
** Obligatory Data Table answer ** **强制性数据表答案**
> library(data.table)
data.table 1.8.0 For help type: help("data.table")
> shop.dt <- data.table(shop)
> shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id']
shop_id sale profit
[1,] 1 26 7
[2,] 2 15 6
[3,] 3 28 14
>
Which sounds fine and good until things get bigger... 在事情变得更大之前,这听起来不错
shop <- data.frame(shop_id = letters[1:10], profit=rnorm(1e7), sale=rnorm(1e7))
shop.dt <- data.table(shop)
> system.time(ddply(shop, .(shop_id), summarise, sale=sum(sale), profit=sum(profit)))
user system elapsed
4.156 1.324 5.514
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
user system elapsed
0.728 0.108 0.840
>
You get additional speed increases if you create the data.table with a key: 如果使用键创建data.table,则会获得额外的速度提升:
shop.dt <- data.table(shop, key='shop_id')
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
user system elapsed
0.252 0.084 0.336
>
Here's how to use base R to speed up operations like this: 以下是如何使用base R来加速这样的操作:
idx <- split(1:nrow(shop), shop$shop_id)
a2 <- data.frame(shop_id=sapply(idx, function(i) shop$shop_id[i[1]]),
sale=sapply(idx, function(i) sum(shop$sale[i])),
profit=sapply(idx, function(i) sum(shop$profit[i])) )
Time reduces to 0.75 sec vs 5.70 sec for the ddply summarise version on my system. 对于我系统上的ddply汇总版本,时间减少到0.75秒对5.70秒。
I think the neatest way to do this is in dplyr
我认为最好的方法是在dplyr
library(dplyr)
shop %>%
group_by(shop_id, shop_name, city) %>%
summarise_all(sum)
Just in case, if you have long list of columns, use summarize_if() 为了以防万一,如果你有很长的列列表,请使用summarize_if()
library(dplyr)
shop %>%
group_by(shop_id, shop_name, city) %>%
summarise_if(is.integer, sum)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.