通過其他因素對因子進行有效計數或制表，並在data.frame中重新整形？

Question

我在使用data.table時，尋找一種有效的方法來計算向量的所有向量級別的累積和（列表）。

問題

dataframe / data.table DT最初由四個變量組成，一個名為experience 。 目標是一個向量，它包含經驗中的因子水平的累積計數條件兩個其他變量， id和cl 。 值得注意的是，因子經驗具有比數據集中存在的更多因子水平（這是必要的屬性）。

數據看起來像

    id trial experience cl
 1:  1     1       000A  A
 2:  1     2       000A  A
 3:  1     3       000B  B
 4:  1     4       111A  A
 5:  1     5       001B  B
 6:  2     1       100B  B
 7:  2     2       111A  A
 8:  2     3       100B  B
 9:  2     4       010A  A
10:  2     5       011B  B

經驗因素水平為16級

levels(DT$experience)
#  [1] "000A" "001A" "010A" "011A" "100A" "101A" "110A" "111A"
#  [9] "000B" "001B" "010B" "011B" "100B" "101B" "110B" "111B"

我們想要計算的是以id和cl為條件的體驗的累積計數。 考慮前三行：對於id = 1，第一個經驗值是000A，因此計數器變量c000A = 1.第二個經驗值也是000A，因此計數器c000A = 2.但現在第三個經驗值是000B，並且所以前一個計數器c000A保持2，但另一個計數器c000B = 1，之前為0。

遵循這個邏輯，我們想要的結果如下：

    id trial experience cl c000A c001A c010A c011A c100A c101A c110A c111A c000B c001B c010B c011B c100B c101B c110B c111B
 1:  1     1       000A  A     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 2:  1     2       000A  A     2     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 3:  1     3       000B  B     2     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
 4:  1     4       111A  A     2     0     0     0     0     0     0     1     1     0     0     0     0     0     0     0
 5:  1     5       001B  B     2     0     0     0     0     0     0     1     1     1     0     0     0     0     0     0
 6:  2     1       100B  B     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0
 7:  2     2       111A  A     0     0     0     0     0     0     0     1     0     0     0     0     1     0     0     0
 8:  2     3       100B  B     0     0     0     0     0     0     0     1     0     0     0     0     2     0     0     0
 9:  2     4       010A  A     0     0     1     0     0     0     0     1     0     0     0     0     2     0     0     0
10:  2     5       011B  B     0     0     1     0     0     0     0     1     0     0     0     1     2     0     0     0

注意：將16個條目c000A，...，c111B分配給不同的列對我來說並不重要。 如果結果是具有16個條目的一個向量作為c000A，c001A，...，c110B，c111B來保持累積計數，那將是完全足夠的。

當前代碼和計算速度

我使用的當前代碼是以下兩步方法。 既不漂亮也不優雅。

foo <- function(DT){
   # tabulate experience for each trial
   # store in an auxiliary variables <s000A, s001A, ..., s110B, s111B>
   DT[, paste(sep="","s",levels(DT$experience)) := as.list(table(experience)), by = c("id","cl","trial")]
   # sum each of the s____ variables by id
   DT[, "c000A" := cumsum(s000A), by = id] # this is clumsy
   DT[, "c001A" := cumsum(s001A), by = id]
   DT[, "c010A" := cumsum(s010A), by = id]
   DT[, "c011A" := cumsum(s011A), by = id]
   DT[, "c100A" := cumsum(s100A), by = id]
   DT[, "c101A" := cumsum(s101A), by = id]
   DT[, "c110A" := cumsum(s110A), by = id]
   DT[, "c111A" := cumsum(s111A), by = id]
   DT[, "c000B" := cumsum(s000B), by = id]
   DT[, "c001B" := cumsum(s001B), by = id]
   DT[, "c010B" := cumsum(s010B), by = id]
   DT[, "c011B" := cumsum(s011B), by = id]
   DT[, "c100B" := cumsum(s100B), by = id]
   DT[, "c101B" := cumsum(s101B), by = id]
   DT[, "c110B" := cumsum(s110B), by = id]
   DT[, "c111B" := cumsum(s111B), by = id]
}

對於具有n = 1e + 4次試驗和2次ID的數據集，此代碼采用：

system.time(foo(DT))
# User  System verstrichen 
# 9.78    0.00       10.05

用於創建此示例的代碼

library("data.table")
library("R.utils")
# Sample dataframe DF with n=1e+4
n <- 1e+4 #to test change this to n=5
DT <- data.table(id = rep(1:2,each=n), trial = rep(1:n,2), experience = c("000A","000A","000B","111A","001B","100B","111A","100B","010A","011B"), cl = c("A","A","B","A","B","B","A","B","A","B")) # experience needs to be a factor w more levels
DT$experience <- factor(DT$experience, levels = paste(sep="", intToBin(0:7), rep(c("A","B"),each=8)))
setkey(DT,id,trial,cl) #set the data.table keys

誰有更快更優雅的解決方案？

謝謝！ 賈納

更新：速度比較：

library("microbenchmark")
benchmk <- microbenchmark(
   DT2  <- foo2(DT),
   DT3a <- foo3a(DT),
   DT3b <- foo3b(DT),
   times=100L
   )
print(benchmk)

# with n=1e+4
#
# unit milliseconds
#              expr      min       lq   median        uq      max neval
# DT2   <- foo2(DT) 46.96745 52.17469 74.72479 120.93339 212.7912   100
# DT3a <- foo3a(DT) 25.21907 26.57921 28.84702  34.89401 121.3164   100
# DT3b <- foo3b(DT) 19.82076 20.80570 22.87369  30.83561 148.0520   100 

# with n=1e+5
#
# unit milliseconds
#              expr       min       lq   median       uq       max neval
#   DT2 <- foo2(DT) 386.93890 445.0184 481.4660 534.9619 1160.6151   100
# DT3a <- foo3a(DT) 144.45937 154.5672 170.6048 233.6362  494.8972   100
# DT3b <- foo3b(DT)  95.91988 100.5313 110.4060 125.1678  364.5651   100

foo2對應Eddi的代碼

foo2 <- function(DT){
    DT[, counter := 1:.N]
    DT[, dummy := 1]
    RE <- dcast.data.table(DT, counter+id ~ experience, value.var = 'dummy', fill = 0)[,lapply(.SD, cumsum), by = id, .SDcols = c(-1,-2)]
    RE[, setdiff(levels(DT$experience), unique(DT$experience)) := 0]
    setcolorder(RE, c("id",levels(DT$experience)))
}

foo3a對應Arun使用關卡的第一個代碼

foo3a <- function(DT){
   ex = levels(DT$experience)
   DT[, c(ex) := 0L]
   tmp = DT[, list(list(.I)), by=experience]
   tmp[, experience := as.character(experience)] ## convert to char
   for(i in seq(nrow(tmp))) {
      set(DT, i=tmp$V1[[i]], j=tmp$experience[i], val=1L)
   }
   DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]
}

foo3b對應Arun使用字符的代碼

foo3b <- function(DT){
   ex = levels(DT$experience)
   DT[, c(ex) := 0L]
   tmp = DT[, list(list(.I)), by=experience]
   tmp[, experience := as.character(experience)] ## convert to char
   for(i in seq(nrow(tmp))) {
      set(DT, i=tmp$V1[[i]], j=tmp$experience[i], val=1L)
   }
   ex = as.character(unique(DT$experience)) ## rewrite 'ex'
   DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]
}

Answer 1

這個怎么樣？

首先創建所有列並將它們初始化為0L。

ex = levels(DT$experience)
DT[, c(ex) := 0L]

現在，按experience分組並獲取與列表中每個experience相對應的行號，如下所示：

tmp = DT[, list(list(.I)), by=experience]
tmp[, experience := as.character(experience)] ## convert to char

然后，你可以循環'每一列並使用set與相應的行（來自列V1 ）和列（來自列experience ）來自tmp ，將1分配給DT的相應列，如下所示：

for(i in seq(nrow(tmp))) {
    set(DT, i=tmp$V1[[i]], j=tmp$experience[i], val=1L)
}

最后通過id在每列上的cumsum ：

DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]

總共需要0.013秒（ dcast.data.table解決方案，這也很好，耗時0.027秒）。

如果你使用as.character(unique(DT$experience))而不是ex在最后一行，你可能可以節省更多的時間..因為有些列全部為0而你不需要cumsum它們來cumsum它們。 那是：

ex = as.character(unique(DT$experience)) ## rewrite 'ex'
DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]

Answer 2

也許這樣的東西：

# add some extra variables
DT[, counter := 1:.N]
DT[, dummy := 1]

dcast.data.table(DT, counter+id ~ experience, value.var = 'dummy', fill = 0)[,
  lapply(.SD, cumsum), by = id, .SDcols = c(-1,-2)]
#       id 000A 010A 111A 000B 001B 011B 100B
#    1:  1    1    0    0    0    0    0    0
#    2:  1    2    0    0    0    0    0    0
#    3:  1    2    0    0    1    0    0    0
#    4:  1    2    0    1    1    0    0    0
#    5:  1    2    0    1    1    1    0    0
#   ---                                      
#19996:  2 2000  999 1999 1000 1000  999 1999
#19997:  2 2000  999 2000 1000 1000  999 1999
#19998:  2 2000  999 2000 1000 1000  999 2000
#19999:  2 2000 1000 2000 1000 1000  999 2000
#20000:  2 2000 1000 2000 1000 1000 1000 2000

如果你願意，你可以cbind它。

通過其他因素對因子進行有效計數或制表，並在data.frame中重新整形？

問題描述

問題

當前代碼和計算速度

用於創建此示例的代碼

更新：速度比較：

2 個解決方案

解決方案1
4 已采納 2014-04-24 16:09:08

解決方案2
2 2014-04-24 15:19:31

通過其他因素對因子進行有效計數或制表，並在data.frame中重新整形？

問題描述

問題

當前代碼和計算速度

用於創建此示例的代碼

更新：速度比較：

2 個解決方案

解決方案1 4 已采納 2014-04-24 16:09:08

解決方案2 2 2014-04-24 15:19:31

解決方案1
4 已采納 2014-04-24 16:09:08

解決方案2
2 2014-04-24 15:19:31