简体   繁体   English

在带有 paste() 的 data.table 中使用 :=

[英]Using := in data.table with paste()

I have started using data.table for a large population model.我已经开始将data.table用于大型人口模型。 So far, I have been impressed because using the data.table structure decreases my simulation run times by about 30%.到目前为止,我印象深刻,因为使用 data.table 结构将我的模拟运行时间减少了大约 30%。 I am trying to further optimize my code and have included a simplified example.我正在尝试进一步优化我的代码并包含一个简化的示例。 My two questions are:我的两个问题是:

  1. Is is possible to use the := operator with this code?是否可以在此代码中使用:=运算符?
  2. Would using the := operator be quicker (although, if I am able to answer my first question, I should be able to answer my question 2!)?使用:=运算符会更快吗(虽然,如果我能够回答我的第一个问题,我应该能够回答我的问题 2!)?

I am using R version 3.1.2 on a machine running Windows 7 with data.table version 1.9.4.我在运行 Windows 7 和data.table版本 1.9.4 的机器上使用 R 版本 3.1.2。

Here is my reproducible example:这是我的可重现示例:

library(data.table)

## Create  example table and set initial conditions
nYears = 10
exampleTable = data.table(Site = paste("Site", 1:3))
exampleTable[ , growthRate := c(1.1, 1.2, 1.3), ]
exampleTable[ , c(paste("popYears", 0:nYears, sep = "")) := 0, ]

exampleTable[ , "popYears0" := c(10, 12, 13)] # set the initial population size

for(yearIndex in 0:(nYears - 1)){
    exampleTable[[paste("popYears", yearIndex + 1, sep = "")]] <- 
    exampleTable[[paste("popYears", yearIndex, sep = "")]] * 
    exampleTable[, growthRate]
}

I am trying to do something like:我正在尝试执行以下操作:

for(yearIndex in 0:(nYears - 1)){
    exampleTable[ , paste("popYears", yearIndex + 1, sep = "") := 
    paste("popYears", yearIndex, sep = "") * growthRate, ] 
}

However, this does not work because the paste does not work with the data.table , for example:但是,这不起作用,因为粘贴不适用于data.table ,例如:

exampleTable[ , paste("popYears", yearIndex + 1, sep = "")]
# [1] "popYears10"

I have looked through the data.table documentation .我已经浏览了data.table 文档 Section 2.9 of the FAQ uses cat , but this produces a null output. FAQ 的第 2.9 节使用cat ,但这会产生空输出。

exampleTable[ , cat(paste("popYears", yearIndex + 1, sep = ""))]
# [1] popYears10NULL

Also, I tried searching Google and rseek.org, but didn't find anything.此外,我尝试搜索 Google 和 rseek.org,但没有找到任何内容。 If am missing an obvious search term, I would appreciate a search tip.如果遗漏了一个明显的搜索词,我会很感激搜索提示。 I have always found searching for R operators to be hard because search engines don't like symbols (eg, " := ") and "R" can be vague.我一直发现搜索 R 运算符很困难,因为搜索引擎不喜欢符号(例如,“ := ”)并且“R”可能含糊不清。

## Start with 1st three columns of example data
dt <- exampleTable[,1:3]

## Run for 1st five years
nYears <- 5
for(ii in seq_len(nYears)-1) {
    y0 <- as.symbol(paste0("popYears", ii))
    y1 <- paste0("popYears", ii+1)
    dt[, (y1) := eval(y0)*growthRate]
}

## Check that it worked
dt
#     Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1        1.1        10      11.0     12.10    13.310   14.6410  16.10510
#2: Site 2        1.2        12      14.4     17.28    20.736   24.8832  29.85984
#3: Site 3        1.3        13      16.9     21.97    28.561   37.1293  48.26809

Edit:编辑:

Because the possibility of speeding this up using set() keeps coming up in the comments, I'll throw this additional option out there.因为使用set()加快速度的可能性不断出现在评论中,所以我将把这个额外的选项扔在那里。

nYears <- 5

## Things that only need to be calculated once can be taken out of the loop
r <- dt[["growthRate"]]
yy <- paste0("popYears", seq_len(nYears+1)-1)

## A loop using set() and data.table's nice compact syntax
for(ii in seq_len(nYears)) {
    set(dt, , yy[ii+1], r*dt[[yy[ii]]])
}

## Check results
dt
#     Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1        1.1        10      11.0     12.10    13.310   14.6410  16.10510
#2: Site 2        1.2        12      14.4     17.28    20.736   24.8832  29.85984
#3: Site 3        1.3        13      16.9     21.97    28.561   37.1293  48.26809

Struggling with column names is a strong indicator that the wide format is probably not the best choice for the given problem.与列名斗争是一个强有力的指标,表明宽格式可能不是给定问题的最佳选择。 Therefore, I suggest to do the computations in long form and to reshape the result from long to wide format, finally.因此,我建议以长格式进行计算,最后将结果从长格式重塑为宽格式。

nYears = 10
params = data.table(Site = paste("Site", 1:3),
                    growthRate = c(1.1, 1.2, 1.3), 
                    pop = c(10, 12, 13))
long <- params[CJ(Site = Site, Year = 0:nYears), on = "Site"][
  , growth := cumprod(shift(growthRate, fill = 1)), by = Site][
    , pop := pop * growth][]
dcast(long, Site + growthRate ~ sprintf("popYears%02i", Year), value.var = "pop")
 Site growthRate popYears 0 popYears 1 popYears 2 popYears 3 popYears 4 popYears 5 popYears 6 popYears 7 popYears 8 popYears 9 popYears10 1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510 17.71561 19.48717 21.43589 23.57948 25.93742 2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984 35.83181 42.99817 51.59780 61.91736 74.30084 3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809 62.74852 81.57307 106.04499 137.85849 179.21604

Explanation解释

First, the parameters are expanded to cover 11 years (including year 0) using the cross join function CJ() and a subsequent right join on Site :首先,使用交叉连接函数CJ()和随后在Site上的右连接将参数扩展为涵盖 11 年(包括第 0 年):

params[CJ(Site = Site, Year = 0:nYears), on = "Site"]
 Site growthRate pop Year 1: Site 1 1.1 10 0 2: Site 1 1.1 10 1 3: Site 1 1.1 10 2 4: Site 1 1.1 10 3 5: Site 1 1.1 10 4 6: Site 1 1.1 10 5 7: Site 1 1.1 10 6 8: Site 1 1.1 10 7 9: Site 1 1.1 10 8 10: Site 1 1.1 10 9 11: Site 1 1.1 10 10 12: Site 2 1.2 12 0 13: Site 2 1.2 12 1 14: Site 2 1.2 12 2 15: Site 2 1.2 12 3 16: Site 2 1.2 12 4 17: Site 2 1.2 12 5 18: Site 2 1.2 12 6 19: Site 2 1.2 12 7 20: Site 2 1.2 12 8 21: Site 2 1.2 12 9 22: Site 2 1.2 12 10 23: Site 3 1.3 13 0 24: Site 3 1.3 13 1 25: Site 3 1.3 13 2 26: Site 3 1.3 13 3 27: Site 3 1.3 13 4 28: Site 3 1.3 13 5 29: Site 3 1.3 13 6 30: Site 3 1.3 13 7 31: Site 3 1.3 13 8 32: Site 3 1.3 13 9 33: Site 3 1.3 13 10 Site growthRate pop Year

Then the growth is computed from the shifted growth rates using the cumulative product function cumprod() separately for each Site .然后,针对每个Site分别使用累积乘积函数cumprod()从转移的增长率计算增长。 The shift is required to skip the initial year for each Site .需要该班次以跳过每个Site的初始年份。 Then the population is computed by multiplying with the intial population.然后通过乘以初始总体来计算总体。

Finally, the data.table is reshaped from long to wide format using dcast() .最后,使用dcast()dcast()从长格式改造成宽格式。 The column headers are created on-the-fly using sprintf() to ensure the correct order of columns.列标题是使用sprintf()即时创建的以确保列的正确顺序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM