[英]Halting drake plan makes it rebuild targets it already had built previously
I'm currently using drake
to run a set of >1k simulations. 我目前正在使用drake
来运行一组> 1k的模拟。 I've estimated that it would take about two days to run the complete set, but I also expect my computer to crash at any point during that period because, well, it has. 我估计运行全套软件大约需要两天时间,但是我也希望在此期间我的计算机在任何时候都崩溃,因为它确实有崩溃的可能。
Apparently stopping the plan discards any targets that were already built so essentially this means I can't use drake
for its intended purpose. 显然,停止计划会丢弃已经建立的任何目标,因此从本质drake
,这意味着我不能将drake
用于其预期目的。
I suppose I could make a function that actually edits the R file where the plan is specified in order to make drake
sequentially add targets to its cache but that seems utterly hackish. 我想我可以创建一个函数,该函数实际上在指定计划的R文件中进行编辑,以使drake
将目标顺序添加到其缓存中,但这似乎完全是骇人听闻的。
Any ideas on how to deal with this? 有关如何处理此问题的任何想法?
EDIT: The actual problem seems to come from using set.seed
inside my data generating functions. 编辑:实际的问题似乎来自在我的数据生成函数内部使用set.seed
。 I was aware that drake
already does this for the user in a way that ensures reproducibility, but I figured that if I just left my functions the way they were it wouldn't change anything since drake
would be ensuring that the random seed I chose always ends up being the same? 我知道drake
已经以确保可重复性的方式为用户做到了这一点,但是我认为,如果我按照他们的方式离开函数,那不会改变任何事情,因为drake
将确保我选择的随机种子始终最终是一样的吗? Guess not, but since I removed that step things are caching fine so the issue is solved. 猜猜不是,但是由于我删除了该步骤,因此一切都可以缓存,因此问题得以解决。
To bring onlookers up to speed, I will try to spell out the problem. 为了使围观者快速了解,我将尝试阐明问题。 @zipzapboing, please correct me if my description is off-target. @zipzapboing,如果我的描述不正确,请更正我。
Let's say you have a script that generates a drake
plan and executes it. 假设您有一个脚本,该脚本可以生成并执行drake
计划 。
library(drake)
simulate_data <- function(seed){
set.seed(seed)
rnorm(100)
}
seed_grid <- data.frame(
id = paste0("target_", 1:3),
seed = sample.int(1e6, 3)
)
print(seed_grid)
#> id seed
#> 1 target_1 581687
#> 2 target_2 700363
#> 3 target_3 914982
plan <- map_plan(seed_grid, simulate_data)
print(plan)
#> # A tibble: 3 x 2
#> target command
#> <chr> <chr>
#> 1 target_1 simulate_data(seed = 581687L)
#> 2 target_2 simulate_data(seed = 700363L)
#> 3 target_3 simulate_data(seed = 914982L)
make(plan)
#> target target_1
#> target target_2
#> target target_3
make(plan)
#> All targets are already up to date.
Created on 2018-11-12 by the reprex package (v0.2.1) 由reprex软件包 (v0.2.1)创建于2018-11-12
The second make()
worked just fine, right? 第二个make()
工作,对吗? But if you were to run the same script in a different session, you would end up with a different plan. 但是,如果要在不同的会话中运行相同的脚本,则最终将有不同的计划。 The randomly-generated seed
arguments to simulate_data()
would be different, so all your targets would build from scratch. 随机生成的simulate_data()
seed
参数将有所不同,因此所有目标都将从头开始构建。
library(drake)
simulate_data <- function(seed){
set.seed(seed)
rnorm(100)
}
seed_grid <- data.frame(
id = paste0("target_", 1:3),
seed = sample.int(1e6, 3)
)
print(seed_grid)
#> id seed
#> 1 target_1 654304
#> 2 target_2 252208
#> 3 target_3 781158
plan <- map_plan(seed_grid, simulate_data)
print(plan)
#> # A tibble: 3 x 2
#> target command
#> <chr> <chr>
#> 1 target_1 simulate_data(seed = 654304L)
#> 2 target_2 simulate_data(seed = 252208L)
#> 3 target_3 simulate_data(seed = 781158L)
make(plan)
#> target target_1
#> target target_2
#> target target_3
Created on 2018-11-12 by the reprex package (v0.2.1) 由reprex软件包 (v0.2.1)创建于2018-11-12
One solution is to be extra careful to hold onto the same plan
. 一种解决方案是要格外小心地坚持同一plan
。 However, there is an even easier way: just let drake
set the seeds for you. 但是,还有一种更简单的方法:让drake
为您设置种子。 drake
automatically gives each target its own reproducible random seed. drake
自动为每个目标提供其自己的可复制随机种子。 These target-level seeds are deterministically generated by a root seed (the seed
argument to make()
) and the names of the targets. 这些目标级别的种子是由根种子( make()
的seed
参数)和目标名称确定性地生成的。
library(digest)
library(drake)
library(magrittr) # defines %>%
simulate_data <- function(){
mean(rnorm(100))
}
plan <- drake_plan(target = simulate_data()) %>%
expand_plan(values = 1:3)
print(plan)
#> # A tibble: 3 x 2
#> target command
#> <chr> <chr>
#> 1 target_1 simulate_data()
#> 2 target_2 simulate_data()
#> 3 target_3 simulate_data()
tmp <- rnorm(1)
digest(.Random.seed) # Fingerprint of the current seed.
#> [1] "0bbddc33a4afe7cd1c1742223764661c"
make(plan)
#> target target_1
#> target target_2
#> target target_3
make(plan)
#> All targets are already up to date.
# The targets have different seeds and different values.
readd(target_1)
#> [1] -0.05530201
readd(target_2)
#> [1] 0.03698055
readd(target_3)
#> [1] 0.05990671
clean() # Destroy the targets.
tmp <- rnorm(1) # Change the global seed.
digest(.Random.seed) # The seed changed.
#> [1] "5993aa5cff4b72a0e14fa58dc5c5e3bf"
make(plan)
#> target target_1
#> target target_2
#> target target_3
# The targets were regenerated with the same values (same seeds).
readd(target_1)
#> [1] -0.05530201
readd(target_2)
#> [1] 0.03698055
readd(target_3)
#> [1] 0.05990671
# You can recover a target's seed from its metadata.
seed <- diagnose(target_1)$seed
print(seed)
#> [1] 1875584181
# And you can use that seed to reproduce
# the target's value outside make().
set.seed(seed)
mean(rnorm(100))
#> [1] -0.05530201
Created on 2018-11-12 by the reprex package (v0.2.1) 由reprex软件包 (v0.2.1)创建于2018-11-12
I really should write more in the manual about how seeds work in drake
and highlight the original pitfall raised in this thread. 我真的应该在手册中写更多关于种子如何在drake
工作的信息,并强调该线程中引发的原始陷阱。 I doubt you are the only one who struggled with this issue. 我怀疑您是唯一在这个问题上苦苦挣扎的人。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.