[英]Fast data.table assign of multiple columns by group from lookup
I have searched for the canonical way to do what I'm trying but I seem to have little luck getting something working that is fast and elegant. 我一直在寻找规范的方法来做我想做的事情,但是我似乎运气不高,却又快又优雅。 In short, I have a large table with multiple value columns and want to multiply each by a corresponding factor from a lookup table.
简而言之,我有一个包含多个值列的大型表,并希望将它们分别乘以查找表中的相应因子。 I cannot figure out how to dynamically pass in which columns I want multiplied by the lookup values, or how to refer to the lookup values in general outside of basic expressions.
我无法弄清楚如何动态地将要乘以哪些列乘以查找值,或者如何在基本表达式之外一般引用引用值。
Here is my example, I have it set up with 3 million rows with 10 value columns, this doesn't take too long and is somewhat representative of the data size (this will be implemented as part of a much larger loop, hence the emphasis on performance). 这是我的示例,我将其设置为300万行,其中包含10个值列,这不会花费太长时间,并且在某种程度上代表了数据大小(这将作为更大循环的一部分来实现,因此重点在于性能)。 There is also a lookup table with 6 levels and some assorted multipliers for our value_1:value_10 columns.
还有一个包含6个级别的查找表,以及我们的value_1:value_10列的一些乘数。
library(data.table)
setsize <- 3000000
value_num <- 10
factors <- c("factor_a", "factor_b", "factor_c", "factor_d", "factor_e", "factor_f")
random <- data.table(replicate(10, sample(factors, size = setsize, replace = T))
, replicate(10, rnorm(setsize, mean = 700, sd = 50)))
lookup <- data.table("V1" = factors, replicate(10, seq(.90, 1.5, length.out = length(factors))))
wps <- paste("value", c(1:10), sep = "_")
names(random)[11:20] <- wps
names(lookup)[2:11] <- wps
setkeyv(random, "V1")
setkeyv(lookup, "V1")
Solution 1: It is fairly quick but I can't figure out how to generically refer to the i-columns like i.value_1
so I can pass them into a loop or better yet apply them all at once. 解决方案1:速度很快,但是我不知道如何通用地引用
i.value_1
之类的i列,因此我可以将它们传递到循环中,或者更好地一次应用它们。
f <- function() {
random[lookup, value_1 := value_1 * i.value_1, by = .EACHI]
random[lookup, value_2 := value_2 * i.value_2, by = .EACHI]
random[lookup, value_3 := value_3 * i.value_3, by = .EACHI]
random[lookup, value_4 := value_4 * i.value_4, by = .EACHI]
random[lookup, value_5 := value_5 * i.value_5, by = .EACHI]
random[lookup, value_6 := value_6 * i.value_6, by = .EACHI]
random[lookup, value_7 := value_7 * i.value_7, by = .EACHI]
random[lookup, value_8 := value_8 * i.value_8, by = .EACHI]
random[lookup, value_9 := value_9 * i.value_9, by = .EACHI]
random[lookup, value_10 := value_10 * i.value_10, by = .EACHI]
}
system.time(f())
user system elapsed
0.184 0.000 0.181
Solution 2: After I could not get solution 1 to be generic, I tried a set()
based approach. 解决方案2:在无法获得通用的解决方案1之后,我尝试了一种基于
set()
的方法。 However despite allowing me to specify the targeted value columns in the character vector wps
, it is actually much much slower than the above. 但是,尽管允许我在字符向量
wps
指定目标值列,但实际上比上述速度要慢得多。 I know I am using it wrong but am unsure how to improve it to remove all the [.data.table overhead. 我知道我使用错了,但是不确定如何改进它以消除所有的[.data.table开销。
idx_groups <- random[,.(rowstart = min(.I), rowend = max(.I)), by = key(random)][lookup]
system.time(
for (i in 1:nrow(idx_groups)){
rows <- idx_groups[["rowstart"]][i]:idx_groups[["rowend"]][i]
for (j in wps) {
set(random, i=rows, j=j, value= random[rows][[j]] * idx_groups[[j]][i])
}
})
user system elapsed
3.940 0.024 3.967
Any advice on how to better structure these operations would be appreciated. 任何有关如何更好地构造这些操作的建议将不胜感激。
Edit: I'm very frustrated with myself for failing to try this obvious solution before posting this question: 编辑:我很沮丧自己未能发布此问题之前尝试这种明显的解决方案:
system.time(
for (col in wps){
random[lookup, (col) := list(get(col) * get(paste0("i.", col))), by = .EACHI, with = F]
})
user system elapsed
1.600 0.048 1.652
which seems to do what I want with relative speed. 这似乎以相对的速度完成了我想要的。 However it is still 10x slower than the first solution above (I'm sure due to the repeated
get()
) so I'm still open to advice. 但是,它仍然比上面的第一个解决方案慢10倍(我敢肯定,由于重复的
get()
),所以我仍然愿意接受建议。
Edit 2: Replacing get()
with eval(parse(text=col))
seems to have done the trick. 编辑2:用
eval(parse(text=col))
代替get()
eval(parse(text=col))
似乎可以解决问题。
system.time(
for (col in wps){
random[lookup, (col) := list(eval(parse(text=col)) * eval(parse(text=paste0("i.", col)))), by = .EACHI, with = F]
})
user system elapsed
0.184 0.000 0.185
Edit 3: Several good working answers have been provided. 编辑3:提供了几个好的工作答案。 Rafael's solution is probably best in the general case, though I will note that I could squeeze a few more milliseconds out of the call construction recommended by Jangorecki in exchange for a rather intimidating looking helper function.
在一般情况下,Rafael的解决方案可能是最好的,尽管我会指出,我可以从Jangorecki建议的调用构造中挤出几毫秒,以换取相当吓人的助手功能。 I've marked it as answered, thanks for the help everyone.
我已将其标记为已回答,感谢大家的帮助。
You can also use lapply
: 您也可以使用
lapply
:
cols <- noquote(paste0("value_",1:10))
random[lookup, (cols) := lapply (cols, function(x) get(x) * get(paste0("i.", x))), by = .EACHI ]
In case your dataset is too big and you want to see a progress bar of your operation, you can use pblapply
: 如果您的数据集太大,并且您想查看操作的进度条,则可以使用
pblapply
:
library(pbapply)
random[lookup, (cols) := pblapply(cols, function(x) get(x) * get(paste0("i.", x))), by = .EACHI ]
这比文本解析/调用构造慢大约2倍,但可读性更高:
random[lookup, (wps) := Map('*', mget(wps), mget(paste0('i.', wps))), by = .EACHI]
Thanks to jangorecki for pointing out his answer here , which dynamically builds the J expression using a helper function and then evaluates all at once. 感谢jangorecki在这里指出他的答案,后者使用助手函数动态构建J表达式,然后立即求值。 It avoids the overhead of parsing/get and seems to be the fastest solution I am going to get.
它避免了解析/获取的开销,并且似乎是我要获得的最快的解决方案。 I also like the ability to manually specify the function being called (some instances I might want
/
instead of *
) and to inspect the J expression before it is evaluated. 我还喜欢手动指定要调用的函数(某些情况下,我可能需要
/
而不是*
)并在评估J表达式之前对其进行检查的功能。
batch.lookup = function(x) {
as.call(list(as.name(":="),x
,as.call(c(
list(as.name("list")),
sapply(x, function(x) call("*", as.name(x), as.name(paste0("i.",x))), simplify=FALSE)
))
))
}
print(batch.lookup(wps))
`:=`(c("value_1", "value_2", "value_3", "value_4", "value_5",
"value_6", "value_7", "value_8", "value_9", "value_10"), list(value_1 = value_1 *
i.value_1, value_2 = value_2 * i.value_2, value_3 = value_3 *
i.value_3, value_4 = value_4 * i.value_4, value_5 = value_5 *
i.value_5, value_6 = value_6 * i.value_6, value_7 = value_7 *
i.value_7, value_8 = value_8 * i.value_8, value_9 = value_9 *
i.value_9, value_10 = value_10 * i.value_10))
system.time(
random[lookup, eval(batch.lookup(wps)), by = .EACHI])
user system elapsed
0.14 0.04 0.18
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.