简体   繁体   English

暂停Drake计划使其能够重建之前已经建立的目标

[英]Halting drake plan makes it rebuild targets it already had built previously

I'm currently using drake to run a set of >1k simulations. 我目前正在使用drake来运行一组> 1k的模拟。 I've estimated that it would take about two days to run the complete set, but I also expect my computer to crash at any point during that period because, well, it has. 我估计运行全套软件大约需要两天时间,但是我也希望在此期间我的计算机在任何时候都崩溃,因为它确实有崩溃的可能。

Apparently stopping the plan discards any targets that were already built so essentially this means I can't use drake for its intended purpose. 显然,停止计划会丢弃已经建立的任何目标,因此从本质drake ,这意味着我不能将drake用于其预期目的。

I suppose I could make a function that actually edits the R file where the plan is specified in order to make drake sequentially add targets to its cache but that seems utterly hackish. 我想我可以创建一个函数,该函数实际上在指定计划的R文件中进行编辑,以使drake将目标顺序添加到其缓存中,但这似乎完全是骇人听闻的。

Any ideas on how to deal with this? 有关如何处理此问题的任何想法?

EDIT: The actual problem seems to come from using set.seed inside my data generating functions. 编辑:实际的问题似乎来自在我的数据生成函数内部使用set.seed I was aware that drake already does this for the user in a way that ensures reproducibility, but I figured that if I just left my functions the way they were it wouldn't change anything since drake would be ensuring that the random seed I chose always ends up being the same? 我知道drake已经以确保可重复性的方式为用户做到了这一点,但是我认为,如果我按照他们的方式离开函数,那不会改变任何事情,因为drake将确保我选择的随机种子始终最终是一样的吗? Guess not, but since I removed that step things are caching fine so the issue is solved. 猜猜不是,但是由于我删除了该步骤,因此一切都可以缓存,因此问题得以解决。

To bring onlookers up to speed, I will try to spell out the problem. 为了使围观者快速了解,我将尝试阐明问题。 @zipzapboing, please correct me if my description is off-target. @zipzapboing,如果我的描述不正确,请更正我。

Let's say you have a script that generates a drake plan and executes it. 假设您有一个脚本,该脚本可以生成并执行drake计划

library(drake)

simulate_data <- function(seed){
  set.seed(seed)
  rnorm(100)
}

seed_grid <- data.frame(
  id = paste0("target_", 1:3),
  seed = sample.int(1e6, 3)
)

print(seed_grid)
#>         id   seed
#> 1 target_1 581687
#> 2 target_2 700363
#> 3 target_3 914982

plan <- map_plan(seed_grid, simulate_data)

print(plan)
#> # A tibble: 3 x 2
#>   target   command                      
#>   <chr>    <chr>                        
#> 1 target_1 simulate_data(seed = 581687L)
#> 2 target_2 simulate_data(seed = 700363L)
#> 3 target_3 simulate_data(seed = 914982L)

make(plan)
#> target target_1
#> target target_2
#> target target_3
make(plan)
#> All targets are already up to date.

Created on 2018-11-12 by the reprex package (v0.2.1) reprex软件包 (v0.2.1)创建于2018-11-12

The second make() worked just fine, right? 第二个make()工作,对吗? But if you were to run the same script in a different session, you would end up with a different plan. 但是,如果要在不同的会话中运行相同的脚本,则最终将有不同的计划。 The randomly-generated seed arguments to simulate_data() would be different, so all your targets would build from scratch. 随机生成的simulate_data() seed参数将有所不同,因此所有目标都将从头开始构建。

library(drake)

simulate_data <- function(seed){
  set.seed(seed)
  rnorm(100)
}

seed_grid <- data.frame(
  id = paste0("target_", 1:3),
  seed = sample.int(1e6, 3)
)

print(seed_grid)
#>         id   seed
#> 1 target_1 654304
#> 2 target_2 252208
#> 3 target_3 781158

plan <- map_plan(seed_grid, simulate_data)

print(plan)
#> # A tibble: 3 x 2
#>   target   command                      
#>   <chr>    <chr>                        
#> 1 target_1 simulate_data(seed = 654304L)
#> 2 target_2 simulate_data(seed = 252208L)
#> 3 target_3 simulate_data(seed = 781158L)

make(plan)
#> target target_1
#> target target_2
#> target target_3

Created on 2018-11-12 by the reprex package (v0.2.1) reprex软件包 (v0.2.1)创建于2018-11-12

One solution is to be extra careful to hold onto the same plan . 一种解决方案是要格外小心地坚持同一plan However, there is an even easier way: just let drake set the seeds for you. 但是,还有一种更简单的方法:让drake为您设置种子。 drake automatically gives each target its own reproducible random seed. drake自动为每个目标提供其自己的可复制随机种子。 These target-level seeds are deterministically generated by a root seed (the seed argument to make() ) and the names of the targets. 这些目标级别的种子是由根种子( make()seed参数)和目标名称确定性地生成的。

library(digest)
library(drake)
library(magrittr) # defines %>%

simulate_data <- function(){
  mean(rnorm(100))
}

plan <- drake_plan(target = simulate_data()) %>%
  expand_plan(values = 1:3)

print(plan)
#> # A tibble: 3 x 2
#>   target   command        
#>   <chr>    <chr>          
#> 1 target_1 simulate_data()
#> 2 target_2 simulate_data()
#> 3 target_3 simulate_data()

tmp <- rnorm(1)
digest(.Random.seed) # Fingerprint of the current seed.
#> [1] "0bbddc33a4afe7cd1c1742223764661c"

make(plan)
#> target target_1
#> target target_2
#> target target_3
make(plan)
#> All targets are already up to date.

# The targets have different seeds and different values.
readd(target_1)
#> [1] -0.05530201
readd(target_2)
#> [1] 0.03698055
readd(target_3)
#> [1] 0.05990671

clean() # Destroy the targets.
tmp <- rnorm(1) # Change the global seed.
digest(.Random.seed) # The seed changed.
#> [1] "5993aa5cff4b72a0e14fa58dc5c5e3bf"

make(plan)
#> target target_1
#> target target_2
#> target target_3

# The targets were regenerated with the same values (same seeds).
readd(target_1)
#> [1] -0.05530201
readd(target_2)
#> [1] 0.03698055
readd(target_3)
#> [1] 0.05990671

# You can recover a target's seed from its metadata.
seed <- diagnose(target_1)$seed
print(seed)
#> [1] 1875584181

# And you can use that seed to reproduce
# the target's value outside make().
set.seed(seed)
mean(rnorm(100))
#> [1] -0.05530201

Created on 2018-11-12 by the reprex package (v0.2.1) reprex软件包 (v0.2.1)创建于2018-11-12

I really should write more in the manual about how seeds work in drake and highlight the original pitfall raised in this thread. 我真的应该在手册中写更多关于种子如何在drake工作的信息,并强调该线程中引发的原始陷阱。 I doubt you are the only one who struggled with this issue. 我怀疑您是唯一在这个问题上苦苦挣扎的人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM