繁体   English   中英

R在data.table中粘贴以子集可变的列数并计算rowMeans

[英]R Using paste in data.table to subset variable number of columns and calculate rowMeans

在过去一周中,我已经开始使用data.table ,并且data.table了问题。 我已经在这里这里研究了解决方案,但是我不确定它对我的情况有什么帮助。

这是我的样本数据。

> dput(dt)
structure(list(link = c(1L, 1L, 1L, 1L, 1L, 1L), id = c(8395, 8738, 9788, 9789, 9908, 9920), person = c(2937837, 3092435, 3511555, 3511555, 3568112, 3575082), seqid = c(11, 14, 9, 1, 7, 10), time = c(NA, NA, 25372, 50700, NA, NA), max = c(14, 31, 9, 7, 8, 11), hr = c(NA, NA, 7, 14, NA, NA), minhr = c(11, 19, 7, 14, 7, 16), maxhr = c(11, 19, 7, 14, 7, 16), TRAVELTIME0.1avg = c(59, 59, 59, 59, 59, 59 ), TRAVELTIME1.2avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME2.3avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME3.4avg = c(59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819 ), TRAVELTIME4.5avg = c(59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214), TRAVELTIME5.6avg = c(60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798), TRAVELTIME6.7avg = c(59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742 ), TRAVELTIME7.8avg = c(59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874), TRAVELTIME8.9avg = c(59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023), TRAVELTIME9.10avg = c(59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216 ), TRAVELTIME10.11avg = c(59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537 ), TRAVELTIME11.12avg = c(59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983 ), TRAVELTIME12.13avg = c(59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587 ), TRAVELTIME13.14avg = c(59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185), TRAVELTIME14.15avg = c(59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924), TRAVELTIME15.16avg = c(59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373 ), TRAVELTIME16.17avg = c(59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671 ), TRAVELTIME17.18avg = c(59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857 ), TRAVELTIME18.19avg = c(59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712 ), TRAVELTIME19.20avg = c(59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269 ), TRAVELTIME20.21avg = c(59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925 ), TRAVELTIME21.22avg = c(59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727 ), TRAVELTIME22.23avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME23.24avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME24.25avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME25.26avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME26.27avg = c(59, 59, 59, 59, 59, 59)), .Names = c("link", "id", "person", "seqid", "time", "max", "hr", "minhr", "maxhr", "TRAVELTIME0.1avg", "TRAVELTIME1.2avg", "TRAVELTIME2.3avg", "TRAVELTIME3.4avg", "TRAVELTIME4.5avg", "TRAVELTIME5.6avg", "TRAVELTIME6.7avg", "TRAVELTIME7.8avg", "TRAVELTIME8.9avg", "TRAVELTIME9.10avg", "TRAVELTIME10.11avg", "TRAVELTIME11.12avg", "TRAVELTIME12.13avg", "TRAVELTIME13.14avg", "TRAVELTIME14.15avg", "TRAVELTIME15.16avg", "TRAVELTIME16.17avg", "TRAVELTIME17.18avg", "TRAVELTIME18.19avg", "TRAVELTIME19.20avg", "TRAVELTIME20.21avg", "TRAVELTIME21.22avg", "TRAVELTIME22.23avg", "TRAVELTIME23.24avg", "TRAVELTIME24.25avg", "TRAVELTIME25.26avg", "TRAVELTIME26.27avg"), sorted = "link", class = c("data.table", "data.frame"), row.names = c(NA, -6L))

更新1:为避免出现internal.selfref问题,请在使用上述示例创建dt之后执行dt <- data.table(dt)

我想使用minhrmaxhr变量来对旅行时间进行子集计算, rowMeans为这些子集的旅行时间计算rowMeans并将其添加到当前dt中 如果minhr (或maxhr )为11,则相应的行进时间列为TRAVELTIME11.12avg ; 如果为19,则对应的旅行时间列为TRAVELTIME19.20avg 因此,如果minhr为9, maxhr为10,则需要获取TRAVELTIME9.10avgTRAVELTIME10.11avg的平均值; 同样,如果minhr为15, maxhr为17,那么我需要获取TRAVELTIME15.16avgTRAVELTIME16.17avgTRAVELTIME17.18avg的平均值

我尝试逐步解决该问题,并将以下代码用于所有行上统一行进时间列的简单情况。 工作正常。

> dt[,avg:=rowMeans(.SD[,TRAVELTIME10.11avg:TRAVELTIME12.13avg, with=FALSE]),by=.(id, seqid)]

接下来,我尝试通过引入paste0()来动态地引用列名称来修改上述代码。 但是,这会导致错误。 此外,我尝试使用as.symbol(paste0())noquote(paste0())和其他两种取消引用的技术都没有成功。

   > dt[,avg:=rowMeans(.SD[,paste0("TRAVELTIME", minhr, "." , minhr+1, "avg"):paste0("TRAVELTIME", maxhr, "." , maxhr+1, "avg"), with=FALSE]),by=.(id, seqid)]

Error in paste0("TRAVELTIME", minhr, ".", minhr + 1, "avg"):paste0("TRAVELTIME",  : 
  NA/NaN argument
In addition: Warning messages:
1: In eval(expr, envir, enclos) : NAs introduced by coercion
2: In eval(expr, envir, enclos) : NAs introduced by coercion

鉴于此,我有两个问题:

1)如果对直接子集的列使用了粘贴命令(即使在取消对粘贴的字符串的引用之后),则data.table不能识别列名,而不是直接使用列名? 是否与每行的列数不相等有关?

2)由于我没有成功,因此,请您提出一种找到每行可变列数均值并将其加回到dt的方法。 如果建议能导致有效的方法,我将不胜感激,因为,我已经使用一种更简单的循环方法进行了尝试,并且需要花费很长时间(整个数据集大约需要12到15个小时)来处理我的数据。

我相信这可以解决您使用paste0的问题:

tmp  <- paste0("TRAVELTIME", dt$minhr, "." , dt$minhr+1, "avg")
tmp1 <- paste0("TRAVELTIME", dt$maxhr, "." , dt$maxhr+1, "avg")
dt1  <- dt[,avg:=rowMeans(.SD[,get(tmp):get(tmp1), with=FALSE]),by=.(dt$id, dt$seqid)]

可能有人会指出,您并不一定严格要求最后一行中的$ ,但是由于问题的性质,您感到这对于识别和解决问题很有用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM