[英]Using := in data.table with paste()
I have started using data.table
for a large population model.我已经开始将
data.table
用于大型人口模型。 So far, I have been impressed because using the data.table structure decreases my simulation run times by about 30%.到目前为止,我印象深刻,因为使用 data.table 结构将我的模拟运行时间减少了大约 30%。 I am trying to further optimize my code and have included a simplified example.
我正在尝试进一步优化我的代码并包含一个简化的示例。 My two questions are:
我的两个问题是:
:=
operator with this code?:=
运算符?:=
operator be quicker (although, if I am able to answer my first question, I should be able to answer my question 2!)?:=
运算符会更快吗(虽然,如果我能够回答我的第一个问题,我应该能够回答我的问题 2!)? I am using R version 3.1.2 on a machine running Windows 7 with data.table
version 1.9.4.我在运行 Windows 7 和
data.table
版本 1.9.4 的机器上使用 R 版本 3.1.2。
Here is my reproducible example:这是我的可重现示例:
library(data.table)
## Create example table and set initial conditions
nYears = 10
exampleTable = data.table(Site = paste("Site", 1:3))
exampleTable[ , growthRate := c(1.1, 1.2, 1.3), ]
exampleTable[ , c(paste("popYears", 0:nYears, sep = "")) := 0, ]
exampleTable[ , "popYears0" := c(10, 12, 13)] # set the initial population size
for(yearIndex in 0:(nYears - 1)){
exampleTable[[paste("popYears", yearIndex + 1, sep = "")]] <-
exampleTable[[paste("popYears", yearIndex, sep = "")]] *
exampleTable[, growthRate]
}
I am trying to do something like:我正在尝试执行以下操作:
for(yearIndex in 0:(nYears - 1)){
exampleTable[ , paste("popYears", yearIndex + 1, sep = "") :=
paste("popYears", yearIndex, sep = "") * growthRate, ]
}
However, this does not work because the paste does not work with the data.table
, for example:但是,这不起作用,因为粘贴不适用于
data.table
,例如:
exampleTable[ , paste("popYears", yearIndex + 1, sep = "")]
# [1] "popYears10"
I have looked through the data.table documentation .我已经浏览了data.table 文档。 Section 2.9 of the FAQ uses
cat
, but this produces a null output. FAQ 的第 2.9 节使用
cat
,但这会产生空输出。
exampleTable[ , cat(paste("popYears", yearIndex + 1, sep = ""))]
# [1] popYears10NULL
Also, I tried searching Google and rseek.org, but didn't find anything.此外,我尝试搜索 Google 和 rseek.org,但没有找到任何内容。 If am missing an obvious search term, I would appreciate a search tip.
如果遗漏了一个明显的搜索词,我会很感激搜索提示。 I have always found searching for R operators to be hard because search engines don't like symbols (eg, "
:=
") and "R" can be vague.我一直发现搜索 R 运算符很困难,因为搜索引擎不喜欢符号(例如,“
:=
”)并且“R”可能含糊不清。
## Start with 1st three columns of example data
dt <- exampleTable[,1:3]
## Run for 1st five years
nYears <- 5
for(ii in seq_len(nYears)-1) {
y0 <- as.symbol(paste0("popYears", ii))
y1 <- paste0("popYears", ii+1)
dt[, (y1) := eval(y0)*growthRate]
}
## Check that it worked
dt
# Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510
#2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984
#3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809
Edit:编辑:
Because the possibility of speeding this up using set()
keeps coming up in the comments, I'll throw this additional option out there.因为使用
set()
加快速度的可能性不断出现在评论中,所以我将把这个额外的选项扔在那里。
nYears <- 5
## Things that only need to be calculated once can be taken out of the loop
r <- dt[["growthRate"]]
yy <- paste0("popYears", seq_len(nYears+1)-1)
## A loop using set() and data.table's nice compact syntax
for(ii in seq_len(nYears)) {
set(dt, , yy[ii+1], r*dt[[yy[ii]]])
}
## Check results
dt
# Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510
#2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984
#3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809
Struggling with column names is a strong indicator that the wide format is probably not the best choice for the given problem.与列名斗争是一个强有力的指标,表明宽格式可能不是给定问题的最佳选择。 Therefore, I suggest to do the computations in long form and to reshape the result from long to wide format, finally.
因此,我建议以长格式进行计算,最后将结果从长格式重塑为宽格式。
nYears = 10
params = data.table(Site = paste("Site", 1:3),
growthRate = c(1.1, 1.2, 1.3),
pop = c(10, 12, 13))
long <- params[CJ(Site = Site, Year = 0:nYears), on = "Site"][
, growth := cumprod(shift(growthRate, fill = 1)), by = Site][
, pop := pop * growth][]
dcast(long, Site + growthRate ~ sprintf("popYears%02i", Year), value.var = "pop")
Site growthRate popYears 0 popYears 1 popYears 2 popYears 3 popYears 4 popYears 5 popYears 6 popYears 7 popYears 8 popYears 9 popYears10 1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510 17.71561 19.48717 21.43589 23.57948 25.93742 2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984 35.83181 42.99817 51.59780 61.91736 74.30084 3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809 62.74852 81.57307 106.04499 137.85849 179.21604
First, the parameters are expanded to cover 11 years (including year 0) using the cross join function CJ()
and a subsequent right join on Site
:首先,使用交叉连接函数
CJ()
和随后在Site
上的右连接将参数扩展为涵盖 11 年(包括第 0 年):
params[CJ(Site = Site, Year = 0:nYears), on = "Site"]
Site growthRate pop Year 1: Site 1 1.1 10 0 2: Site 1 1.1 10 1 3: Site 1 1.1 10 2 4: Site 1 1.1 10 3 5: Site 1 1.1 10 4 6: Site 1 1.1 10 5 7: Site 1 1.1 10 6 8: Site 1 1.1 10 7 9: Site 1 1.1 10 8 10: Site 1 1.1 10 9 11: Site 1 1.1 10 10 12: Site 2 1.2 12 0 13: Site 2 1.2 12 1 14: Site 2 1.2 12 2 15: Site 2 1.2 12 3 16: Site 2 1.2 12 4 17: Site 2 1.2 12 5 18: Site 2 1.2 12 6 19: Site 2 1.2 12 7 20: Site 2 1.2 12 8 21: Site 2 1.2 12 9 22: Site 2 1.2 12 10 23: Site 3 1.3 13 0 24: Site 3 1.3 13 1 25: Site 3 1.3 13 2 26: Site 3 1.3 13 3 27: Site 3 1.3 13 4 28: Site 3 1.3 13 5 29: Site 3 1.3 13 6 30: Site 3 1.3 13 7 31: Site 3 1.3 13 8 32: Site 3 1.3 13 9 33: Site 3 1.3 13 10 Site growthRate pop Year
Then the growth is computed from the shifted growth rates using the cumulative product function cumprod()
separately for each Site
.然后,针对每个
Site
分别使用累积乘积函数cumprod()
从转移的增长率计算增长。 The shift is required to skip the initial year for each Site
.需要该班次以跳过每个
Site
的初始年份。 Then the population is computed by multiplying with the intial population.然后通过乘以初始总体来计算总体。
Finally, the data.table is reshaped from long to wide format using dcast()
.最后,使用
dcast()
将dcast()
从长格式改造成宽格式。 The column headers are created on-the-fly using sprintf()
to ensure the correct order of columns.列标题是使用
sprintf()
即时创建的,以确保列的正确顺序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.