简体   繁体   中英

Problem to reproduce results from parallelSVM in R

I am not able to set a seed value to get reproducible results from parallelSVM() .

 library(e1071)
 library(parallelSVM)

 data(iris)
 x <- subset(iris, select = -Species)
 y <- iris$Species

set.seed(1)
model       <- parallelSVM(x, y)
parallelPredictions <- predict(model, x)

set.seed(1)
model2       <- parallelSVM(x, y)
parallelPredictions2 <- predict(model2, x)

all.equal(parallelPredictions,parallelPredictions2) 

I know that this is not the right way to set a seed value for multicore operations, but I have no clue what to do alternatively.

I know there is an option, when using mclapply , but that does not help in my situation.


Edit:
I have found a solution by changing the function trainSample() within the parallelSVM with a trace and the doRNG package for seeds with the foreach loop.

Does anybody know a better solution?

In short, there is no implemented method in parallelSVM to handle this issue. However the package uses the foreach and doParallel packages to handle it's parallel operations. And digging hard enough on stackoverflow a solution is possible!

Credits to this answer , on the usage of the doRNG package, and this answer for giving me an idea for a simpler enclosed solution.

Solution:

In the parallelSVM package the parallelization happens through the parallelSVM::registerCores functions. This function simply calls doParallel::registerDoParallel with the number of cores, and no further arguments. My idea is simply to change the parallelSVM::registerCores function, such that it automatically sets the seed at after creating a new cluster.

When performing parallel computation, in which you need a parallel seed, there are 2 things you need to ensure

  1. The seed needs to be given to each node in the cluster
  2. The generator needs to be one that is asymptotically random across clusters.

Luckily the doRNG package handles the first and uses a seed that which is alright on 2. Using a combination of unlockNamespace and assign we can overwrite the parallelSVM::registerCores , such that it includes a call to doRNG::registerDoRNG with the appropriate seed (function at the end of answer). Doing this we can actually get proper reproducibility as illstrated below:

library(parallelSVM)
library(e1071)
data(magicData)
set.seed.parallelSWM(1) #<=== set seed as we would normally.
#Example from help(parallelSVM)
system.time(parallelSvm1 <- parallelSVM(V11 ~ ., data = trainData[,-1],
                                       numberCores = 4, samplingSize = 0.2, 
                                       probability = TRUE, gamma=0.1, cost = 10))
system.time(parallelSvm2 <- parallelSVM(V11 ~ ., data = trainData[,-1],
                                       numberCores = 4, samplingSize = 0.2, 
                                       probability = TRUE, gamma=0.1, cost = 10))
pred1 <- predict(parallelSvm1)
pred2 <- predict(parallelSvm2)
all.equal(pred1, pred2)
[1] TRUE
identical(parallelSvm1, parallelSvm2)
[1] FALSE

Note that identical does not have the power to properly asses the objects output by parallel::parallelSvm , and thus the predictions are better to check whether the models are identical.

For safety lets check if this is also the case for the reproducible example in the question

x <- subset(iris, select = -Species)
y <- iris$Species
set.seed.parallelSWM(1) #<=== set seed as we would normally (not necessary if above example has been run).
model       <- parallelSVM(x, y)
model2       <- parallelSVM(x, y)
parallelPredicitions <- predict(model, x)
parallelPredicitions2 <- predict(model2, x)
all.equal(parallelPredicitions, parallelPredicitions2)
[1] TRUE

Phew..

Last, if we are done, or if we wanted random seeds once again, we can reset the seed by executing

set.seed.parallelSWM() #<=== set seed to random each execution (standard).
#check:
model       <- parallelSVM(x, y)
model2       <- parallelSVM(x, y)
parallelPredicitions <- predict(model, x)
parallelPredicitions2 <- predict(model2, x)
all.equal(parallelPredicitions, parallelPredicitions2)
[1] "3 string mismatches"

(the output will vary, as the RNNG seed is not set)

set.seed.parallelSWM function

credits to this answer . Note that we might not have to double up on the assignment, but here i simply replicated the answer without checking if the code could be further reduced.

set.seed.parallelSWM <- function(seed, once = TRUE){
    if(missing(seed) || is.character(seed)){
        out <- function (numberCores) 
        {
            cluster <- parallel::makeCluster(numberCores)
            doParallel::registerDoParallel(cluster)
        }
    }else{
        require("doRNG", quietly = TRUE, character.only = TRUE)
        out <- function(numberCores){
            cluster <- parallel::makeCluster(numberCores)
            doParallel::registerDoParallel(cluster)
            doRNG::registerDoRNG(seed = seed, once = once)
        }
    }
    unlockBinding("registerCores", as.environment("package:parallelSVM"))
    assign("registerCores", out, "package:parallelSVM")
    lockBinding("registerCores", as.environment("package:parallelSVM"))
    unlockBinding("registerCores", getNamespace("parallelSVM"))
    assign("registerCores", out, getNamespace("parallelSVM"))
    lockBinding("registerCores", getNamespace("parallelSVM"))
    #unlockBinding("registerCores", as.environment("package:parallelSVM"))
    invisible()
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM