简体   繁体   English


[英]data.tables and sweep function

Using a data.table, which would be the fastest way to "sweep" out a statistic across a selection of columns? 使用data.table,这是在一系列列中“扫出”统计数据的最快方法吗?

Starting with (considerably larger versions of ) DT 从(相当大的版本)DT开始

p <- 3
DT <- data.table(id=c("A","B","C"),x1=c(10,20,30),x2=c(20,30,10))
DT.totals <- DT[, list(id,total = x1+x2) ]

I'd like to get to the following data.table result by indexing the target columns (2:p) in order to skip the key: 我想通过索引目标列(2:p)来跳过密钥来获得以下data.table结果:

    id  x1  x2
[1,]    A   0.33    0.67
[2,]    B   0.40    0.60
[3,]    C   0.75    0.25

I believe that something close to the following (which uses the relatively new set() function) will be quickest: 我相信接近以下内容(使用相对较新的set()函数)将是最快的:

DT <- data.table(id = c("A","B","C"), x1 = c(10,20,30), x2 = c(20,30,10))
total <- DT[ , x1 + x2]

rr <- seq_len(nrow(DT))
for(j in 2:3) set(DT, rr, j, DT[[j]]/total) 
#      id        x1        x2
# [1,]  A 0.3333333 0.6666667
# [2,]  B 0.4000000 0.6000000
# [3,]  C 0.7500000 0.2500000

FWIW, calls to set() takes the following form: FWIW,对set()调用采用以下形式:

# set(x, i, j, value), where: 
#     x is a data.table 
#     i contains row indices
#     j contains column indices 
#     value is the value to be assigned into the specified cells

My suspicion about the relative speed of this, compared to other solutions, is based on this passage from data.table's NEWS file , in the section on changes in Version 1.8.0: 与其他解决方案相比,我怀疑它的相对速度是基于data.table的NEWS文件 ,在1.8.0版的更改部分中的这段话:

 o New function set(DT,i,j,value) allows fast assignment to elements of DT. Similar to := but avoids the overhead of [.data.table, so is much faster inside a loop. Less flexible than :=, but as flexible as matrix subassignment. Similar in spirit to setnames(), setcolorder(), setkey() and setattr(); ie, assigns by reference with no copy at all. M = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(M) DT = as.data.table(M) system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM