简体   繁体   English

R data.table group由多列组成1列和求和

[英]R data.table group by multiple columns into 1 column and sum

I have the following data.table : 我有以下data.table

> dt = data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400))
> dt
   sales_ccy sales_amt cost_ccy cost_amt
1:       USD       500      GBP     -100
2:       EUR       600      USD     -200
3:       GBP       700      GBP     -300
4:       USD       800      USD     -400

My aim is to get the following data.table : 我的目标是获得以下data.table

> dt
   ccy total_amt
1: EUR       600
2: GBP       300
3: USD       700

Basically, I want to sum all costs and sales together by currency. 基本上,我想按货币汇总所有成本和销售额。 In reality, this data.table has >500,000 rows so I would want a fast and efficient way to sum the amounts together. 实际上,这个data.table有> 500,000行,所以我想要一种快速有效的方法来将总和相加。

Any idea of a fast way to do this? 想快速做到这一点的想法吗?

Using data.table v1.9.6+ which has improved version of melt which can melt in to multiple columns simultaneously, 使用data.table v1.9.6+ ,它具有改进的melt版本,可以同时熔化成多个色谱柱,

require(data.table) # v1.9.6+
melt(dt, measure = patterns("_ccy$", "_amt$")
    )[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)]

You can consider merged.stack from my "splitstackshape" package. 您可以从我的“splitstackshape”包中考虑merged.stack

Here, I've also used "dplyr" for piping, but you can skip that if you prefer. 在这里,我也使用“dplyr”进行滚边,但如果您愿意,可以跳过它。

library(dplyr)
library(splitstackshape)

dt %>%
  mutate(id = 1:nrow(dt)) %>%
  merged.stack(var.stub = c("ccy", "amt"), sep = "var.stubs", atStart = FALSE) %>%
  .[, .(total_amt = sum(amt)), by = ccy]
#    ccy total_amt
# 1: GBP       300
# 2: USD       700
# 3: EUR       600

The development version of "data.table" should be able to handle melting groups of columns. “data.table”的开发版本应该能够处理熔化的列组。 It's also faster than merged.stack . 它也比merged.stack更快。

Even dirtier than @Pgibas 's solution: 甚至比@Pgibas的解决方案更脏:

dt[,
   list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt
   by=list(sales_ccy, cost_ccy)  # nro of rows reduced to only unique combination ales_ccy, cost_ccy
  ][,
    sum(V2), # this will aggregate the new columns
    by=V1
    ]

Benchmark 基准

I did a couple of test to check my code against the solution with Data Table 1.9.5 suggested by Arun. 我做了几个测试来检查我的代码与Arun建议的Data Table 1.9.5的解决方案。

Just an observation, I just generated 500K+ rows duplicating the original data.table, this reduced the number of couple sales_ccy/cost_ccy, which reduced also the number of row crunched by the second data.table [] (just 8 rows created in this scenario). 只是一个观察,我只生成500K +行复制原始data.table,这减少了几个sales_ccy / cost_ccy的数量,这也减少了第二个data.table []所挤压的行数(在这种情况下只创建了8行) )。

I don't think that in a real world scenario the number of rows returned will be near 500K+ (probably, but I studied these thing a while ago, N^2 where N is the number of currency used), but it's still something to keep in mind looking at these results. 我不认为在现实世界的情况下,返回的行数将接近500K +(可能,但我刚刚研究过这些东西,N ^ 2,其中N是使用的货币数量),但它仍然是请记住查看这些结果。

library(data.table)
library(microbenchmark)

rm(dt)
dt <- data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400))
dt


for (i in 1:17) dt <- rbind(dt,dt)

mycode <-function() {
  test1 <- dt[,
              list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt
              keyby=list(sales_ccy, cost_ccy) 
             ][,
                sum(V2), # this will aggregate the new columns
                by=V1
              ]
  rm(test1)
}

suggesteEdit <- function() {

  test2 <- dt[ , .(c(sales_ccy, cost_ccy), c(sales_amt, cost_amt)) # combine cols
   ][, .(tot_amt = sum(V2)), keyby= .(ccy = V1)          # aggregate + reorder
     ]
   rm(test2)
}

meltWithDataTable195 <- function() {
  test3 <- melt(dt, measure = list( c(1,3), c(2,4) ))[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)]
  rm(test3)
}

microbenchmark(
  mycode(),
  suggesteEdit(),
  meltWithDataTable195()
)

Result 结果

Unit: milliseconds
                   expr      min       lq     mean   median       uq      max neval
               mycode() 12.27895 12.47456 15.04098 12.80956 14.73432 45.26173   100
         suggesteEdit() 25.36581 29.56553 42.52952 33.39229 59.72346 69.74819   100
 meltWithDataTable195() 25.71558 30.97693 47.77700 58.68051 61.23996 66.49597   100

Edited Another way to do this using aggregate() 编辑使用aggregate()执行此操作的另一种方法

df = data.frame(ccy = c(dt$sales_ccy, dt$cost_ccy), total_amt = c(dt$sales_amt, dt$cost_amt))
out= aggregate(total_amt ~ ccy, data = df, sum)

Dirty but works 肮脏但有效

# Bind costs and sales
df <- rbind(df[,list(ccy = cost_ccy, total_amt = cost_amt)], 
            df[,list(ccy = sales_ccy, total_amt = sales_amt)])
# Sum for every currency
df[, sum(total_amt), by = ccy]
   ccy  V1
1: GBP 300
2: USD 700
3: EUR 600

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM