简体   繁体   English

从 R 中的 parallelSVM 重现结果的问题

[英]Problem to reproduce results from parallelSVM in R

I am not able to set a seed value to get reproducible results from parallelSVM() .我无法设置种子值以从parallelSVM()获得可重现的结果。

 library(e1071)
 library(parallelSVM)

 data(iris)
 x <- subset(iris, select = -Species)
 y <- iris$Species

set.seed(1)
model       <- parallelSVM(x, y)
parallelPredictions <- predict(model, x)

set.seed(1)
model2       <- parallelSVM(x, y)
parallelPredictions2 <- predict(model2, x)

all.equal(parallelPredictions,parallelPredictions2) 

I know that this is not the right way to set a seed value for multicore operations, but I have no clue what to do alternatively.我知道这不是为多核操作设置种子值的正确方法,但我不知道该怎么做。

I know there is an option, when using mclapply , but that does not help in my situation.我知道在使用mclapply时有一个选项,但这对我的情况没有帮助。


Edit:编辑:
I have found a solution by changing the function trainSample() within the parallelSVM with a trace and the doRNG package for seeds with the foreach loop.我找到了一个解决方案,方法是通过trace更改parallelSVM中的 function trainSample()并使用foreach循环更改种子的doRNG package。

Does anybody know a better solution?有人知道更好的解决方案吗?

In short, there is no implemented method in parallelSVM to handle this issue.简而言之, parallelSVM中没有实现的方法来处理这个问题。 However the package uses the foreach and doParallel packages to handle it's parallel operations.然而 package 使用foreachdoParallel包来处理它的并行操作。 And digging hard enough on stackoverflow a solution is possible!并且在stackoverflow上足够努力地挖掘解决方案是可能的!

Credits to this answer , on the usage of the doRNG package, and this answer for giving me an idea for a simpler enclosed solution.归功于这个答案,关于doRNG package 的使用,这个答案让我对更简单的封闭解决方案有了一个想法。

Solution:解决方案:

In the parallelSVM package the parallelization happens through the parallelSVM::registerCores functions.parallelSVM package 中,并行化通过parallelSVM::registerCores函数发生。 This function simply calls doParallel::registerDoParallel with the number of cores, and no further arguments.这个 function 只是简单地使用核心数量调用doParallel::registerDoParallel ,而不是进一步的 arguments。 My idea is simply to change the parallelSVM::registerCores function, such that it automatically sets the seed at after creating a new cluster.我的想法是简单地更改parallelSVM::registerCores function,以便在创建新集群后自动将种子设置为。

When performing parallel computation, in which you need a parallel seed, there are 2 things you need to ensure在执行需要并行种子的并行计算时,需要确保两件事

  1. The seed needs to be given to each node in the cluster种子需要给集群中的每个节点
  2. The generator needs to be one that is asymptotically random across clusters.生成器必须是一个在集群中渐近随机的生成器。

Luckily the doRNG package handles the first and uses a seed that which is alright on 2. Using a combination of unlockNamespace and assign we can overwrite the parallelSVM::registerCores , such that it includes a call to doRNG::registerDoRNG with the appropriate seed (function at the end of answer).幸运的是doRNG package 处理第一个并使用在 2 上没问题的种子。使用unlockNamespaceassign的组合,我们可以覆盖parallelSVM::registerCores ,这样它就可以使用适当的种子调用doRNG::registerDoRNG (答案末尾的函数)。 Doing this we can actually get proper reproducibility as illstrated below:这样做我们实际上可以获得适当的再现性,如下所示:

library(parallelSVM)
library(e1071)
data(magicData)
set.seed.parallelSWM(1) #<=== set seed as we would normally.
#Example from help(parallelSVM)
system.time(parallelSvm1 <- parallelSVM(V11 ~ ., data = trainData[,-1],
                                       numberCores = 4, samplingSize = 0.2, 
                                       probability = TRUE, gamma=0.1, cost = 10))
system.time(parallelSvm2 <- parallelSVM(V11 ~ ., data = trainData[,-1],
                                       numberCores = 4, samplingSize = 0.2, 
                                       probability = TRUE, gamma=0.1, cost = 10))
pred1 <- predict(parallelSvm1)
pred2 <- predict(parallelSvm2)
all.equal(pred1, pred2)
[1] TRUE
identical(parallelSvm1, parallelSvm2)
[1] FALSE

Note that identical does not have the power to properly asses the objects output by parallel::parallelSvm , and thus the predictions are better to check whether the models are identical.请注意, identical没有能力通过parallel::parallelSvm正确评估对象 output,因此预测更好地检查模型是否相同。

For safety lets check if this is also the case for the reproducible example in the question为了安全起见,让我们检查问题中的可重复示例是否也是这种情况

x <- subset(iris, select = -Species)
y <- iris$Species
set.seed.parallelSWM(1) #<=== set seed as we would normally (not necessary if above example has been run).
model       <- parallelSVM(x, y)
model2       <- parallelSVM(x, y)
parallelPredicitions <- predict(model, x)
parallelPredicitions2 <- predict(model2, x)
all.equal(parallelPredicitions, parallelPredicitions2)
[1] TRUE

Phew..呸..

Last, if we are done, or if we wanted random seeds once again, we can reset the seed by executing最后,如果我们完成了,或者如果我们再次想要随机种子,我们可以通过执行重置种子

set.seed.parallelSWM() #<=== set seed to random each execution (standard).
#check:
model       <- parallelSVM(x, y)
model2       <- parallelSVM(x, y)
parallelPredicitions <- predict(model, x)
parallelPredicitions2 <- predict(model2, x)
all.equal(parallelPredicitions, parallelPredicitions2)
[1] "3 string mismatches"

(the output will vary, as the RNNG seed is not set) (output 会有所不同,因为未设置 RNNG 种子)

set.seed.parallelSWM function set.seed.parallelSWM function

credits to this answer .归功于这个答案 Note that we might not have to double up on the assignment, but here i simply replicated the answer without checking if the code could be further reduced.请注意,我们可能不必加倍分配,但在这里我只是简单地复制了答案,而不检查代码是否可以进一步减少。

set.seed.parallelSWM <- function(seed, once = TRUE){
    if(missing(seed) || is.character(seed)){
        out <- function (numberCores) 
        {
            cluster <- parallel::makeCluster(numberCores)
            doParallel::registerDoParallel(cluster)
        }
    }else{
        require("doRNG", quietly = TRUE, character.only = TRUE)
        out <- function(numberCores){
            cluster <- parallel::makeCluster(numberCores)
            doParallel::registerDoParallel(cluster)
            doRNG::registerDoRNG(seed = seed, once = once)
        }
    }
    unlockBinding("registerCores", as.environment("package:parallelSVM"))
    assign("registerCores", out, "package:parallelSVM")
    lockBinding("registerCores", as.environment("package:parallelSVM"))
    unlockBinding("registerCores", getNamespace("parallelSVM"))
    assign("registerCores", out, getNamespace("parallelSVM"))
    lockBinding("registerCores", getNamespace("parallelSVM"))
    #unlockBinding("registerCores", as.environment("package:parallelSVM"))
    invisible()
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM