[英]Efficient count or tabulation of a factor by other factors and reshaping in a data.frame?
我在使用data.table時,尋找一種有效的方法來計算向量的所有向量級別的累積和(列表)。
dataframe / data.table DT最初由四個變量組成,一個名為experience 。 目標是一個向量,它包含經驗中的因子水平的累積計數條件兩個其他變量, id和cl 。 值得注意的是,因子經驗具有比數據集中存在的更多因子水平(這是必要的屬性)。
數據看起來像
id trial experience cl
1: 1 1 000A A
2: 1 2 000A A
3: 1 3 000B B
4: 1 4 111A A
5: 1 5 001B B
6: 2 1 100B B
7: 2 2 111A A
8: 2 3 100B B
9: 2 4 010A A
10: 2 5 011B B
經驗因素水平為16級
levels(DT$experience)
# [1] "000A" "001A" "010A" "011A" "100A" "101A" "110A" "111A"
# [9] "000B" "001B" "010B" "011B" "100B" "101B" "110B" "111B"
我們想要計算的是以id和cl為條件的體驗的累積計數。 考慮前三行:對於id = 1,第一個經驗值是000A,因此計數器變量c000A = 1.第二個經驗值也是000A,因此計數器c000A = 2.但現在第三個經驗值是000B,並且所以前一個計數器c000A保持2,但另一個計數器c000B = 1,之前為0。
遵循這個邏輯,我們想要的結果如下:
id trial experience cl c000A c001A c010A c011A c100A c101A c110A c111A c000B c001B c010B c011B c100B c101B c110B c111B
1: 1 1 000A A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2: 1 2 000A A 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 1 3 000B B 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
4: 1 4 111A A 2 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5: 1 5 001B B 2 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
6: 2 1 100B B 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7: 2 2 111A A 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
8: 2 3 100B B 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0
9: 2 4 010A A 0 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0
10: 2 5 011B B 0 0 1 0 0 0 0 1 0 0 0 1 2 0 0 0
注意 :將16個條目c000A,...,c111B分配給不同的列對我來說並不重要。 如果結果是具有16個條目的一個向量作為c000A,c001A,...,c110B,c111B來保持累積計數,那將是完全足夠的。
我使用的當前代碼是以下兩步方法。 既不漂亮也不優雅。
foo <- function(DT){
# tabulate experience for each trial
# store in an auxiliary variables <s000A, s001A, ..., s110B, s111B>
DT[, paste(sep="","s",levels(DT$experience)) := as.list(table(experience)), by = c("id","cl","trial")]
# sum each of the s____ variables by id
DT[, "c000A" := cumsum(s000A), by = id] # this is clumsy
DT[, "c001A" := cumsum(s001A), by = id]
DT[, "c010A" := cumsum(s010A), by = id]
DT[, "c011A" := cumsum(s011A), by = id]
DT[, "c100A" := cumsum(s100A), by = id]
DT[, "c101A" := cumsum(s101A), by = id]
DT[, "c110A" := cumsum(s110A), by = id]
DT[, "c111A" := cumsum(s111A), by = id]
DT[, "c000B" := cumsum(s000B), by = id]
DT[, "c001B" := cumsum(s001B), by = id]
DT[, "c010B" := cumsum(s010B), by = id]
DT[, "c011B" := cumsum(s011B), by = id]
DT[, "c100B" := cumsum(s100B), by = id]
DT[, "c101B" := cumsum(s101B), by = id]
DT[, "c110B" := cumsum(s110B), by = id]
DT[, "c111B" := cumsum(s111B), by = id]
}
對於具有n = 1e + 4次試驗和2次ID的數據集,此代碼采用:
system.time(foo(DT))
# User System verstrichen
# 9.78 0.00 10.05
library("data.table")
library("R.utils")
# Sample dataframe DF with n=1e+4
n <- 1e+4 #to test change this to n=5
DT <- data.table(id = rep(1:2,each=n), trial = rep(1:n,2), experience = c("000A","000A","000B","111A","001B","100B","111A","100B","010A","011B"), cl = c("A","A","B","A","B","B","A","B","A","B")) # experience needs to be a factor w more levels
DT$experience <- factor(DT$experience, levels = paste(sep="", intToBin(0:7), rep(c("A","B"),each=8)))
setkey(DT,id,trial,cl) #set the data.table keys
誰有更快更優雅的解決方案?
謝謝! 賈納
library("microbenchmark")
benchmk <- microbenchmark(
DT2 <- foo2(DT),
DT3a <- foo3a(DT),
DT3b <- foo3b(DT),
times=100L
)
print(benchmk)
# with n=1e+4
#
# unit milliseconds
# expr min lq median uq max neval
# DT2 <- foo2(DT) 46.96745 52.17469 74.72479 120.93339 212.7912 100
# DT3a <- foo3a(DT) 25.21907 26.57921 28.84702 34.89401 121.3164 100
# DT3b <- foo3b(DT) 19.82076 20.80570 22.87369 30.83561 148.0520 100
# with n=1e+5
#
# unit milliseconds
# expr min lq median uq max neval
# DT2 <- foo2(DT) 386.93890 445.0184 481.4660 534.9619 1160.6151 100
# DT3a <- foo3a(DT) 144.45937 154.5672 170.6048 233.6362 494.8972 100
# DT3b <- foo3b(DT) 95.91988 100.5313 110.4060 125.1678 364.5651 100
foo2對應Eddi的代碼
foo2 <- function(DT){
DT[, counter := 1:.N]
DT[, dummy := 1]
RE <- dcast.data.table(DT, counter+id ~ experience, value.var = 'dummy', fill = 0)[,lapply(.SD, cumsum), by = id, .SDcols = c(-1,-2)]
RE[, setdiff(levels(DT$experience), unique(DT$experience)) := 0]
setcolorder(RE, c("id",levels(DT$experience)))
}
foo3a對應Arun使用關卡的第一個代碼
foo3a <- function(DT){
ex = levels(DT$experience)
DT[, c(ex) := 0L]
tmp = DT[, list(list(.I)), by=experience]
tmp[, experience := as.character(experience)] ## convert to char
for(i in seq(nrow(tmp))) {
set(DT, i=tmp$V1[[i]], j=tmp$experience[i], val=1L)
}
DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]
}
foo3b對應Arun使用字符的代碼
foo3b <- function(DT){
ex = levels(DT$experience)
DT[, c(ex) := 0L]
tmp = DT[, list(list(.I)), by=experience]
tmp[, experience := as.character(experience)] ## convert to char
for(i in seq(nrow(tmp))) {
set(DT, i=tmp$V1[[i]], j=tmp$experience[i], val=1L)
}
ex = as.character(unique(DT$experience)) ## rewrite 'ex'
DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]
}
這個怎么樣?
首先創建所有列並將它們初始化為0L。
ex = levels(DT$experience)
DT[, c(ex) := 0L]
現在,按experience
分組並獲取與列表中每個experience
相對應的行號,如下所示:
tmp = DT[, list(list(.I)), by=experience]
tmp[, experience := as.character(experience)] ## convert to char
然后,你可以循環'每一列並使用set
與相應的行(來自列V1
)和列(來自列experience
)來自tmp
,將1
分配給DT
的相應列,如下所示:
for(i in seq(nrow(tmp))) {
set(DT, i=tmp$V1[[i]], j=tmp$experience[i], val=1L)
}
最后通過id
在每列上的cumsum
:
DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]
總共需要0.013秒( dcast.data.table
解決方案,這也很好,耗時0.027秒)。
如果你使用as.character(unique(DT$experience))
而不是ex
在最后一行,你可能可以節省更多的時間..因為有些列全部為0而你不需要cumsum
它們來cumsum
它們。 那是:
ex = as.character(unique(DT$experience)) ## rewrite 'ex'
DT[, c(ex) := lapply(.SD, cumsum), by=id, .SDcols=ex]
也許這樣的東西:
# add some extra variables
DT[, counter := 1:.N]
DT[, dummy := 1]
dcast.data.table(DT, counter+id ~ experience, value.var = 'dummy', fill = 0)[,
lapply(.SD, cumsum), by = id, .SDcols = c(-1,-2)]
# id 000A 010A 111A 000B 001B 011B 100B
# 1: 1 1 0 0 0 0 0 0
# 2: 1 2 0 0 0 0 0 0
# 3: 1 2 0 0 1 0 0 0
# 4: 1 2 0 1 1 0 0 0
# 5: 1 2 0 1 1 1 0 0
# ---
#19996: 2 2000 999 1999 1000 1000 999 1999
#19997: 2 2000 999 2000 1000 1000 999 1999
#19998: 2 2000 999 2000 1000 1000 999 2000
#19999: 2 2000 1000 2000 1000 1000 999 2000
#20000: 2 2000 1000 2000 1000 1000 1000 2000
如果你願意,你可以cbind
它。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.