简体   繁体   中英

Setting up H2O cluster while running several R programs (close to 20) that needs access to cluster

I have to run the same R script in parallel (via batches) with different parameters. The R script builds and scores a H2O model. In this case should I,

  1. Set up an individual cluster for each batch run of the R script?

(OR)

  1. Create a common cluster and set the scripts to use it?

I would prefer the latter solution, but I am not sure how to automate initialization and shutting down of the H2O cluster for so many batches. The first batch has to create the cluster (H2O.init() and the last batch has to shut it down)

Setting up individual h2o cluster per R session is ideal.

While initiating a h2o cluster with h2o::h2o.init() , Make sure you specify these differently for each R session (each script running its own R session):

  • ip / port (port under localhost which is already not taken)
  • name (to check its progress/usage on terminal via top/htop)

Change other options as required. Each R session knows the h2o cluster it is running and h2o::h2o.shutdown() will only shutdown the specific h2o cluster.

Set up a single cluster, and have all scripts use it is the recommended approach, because it is more efficient. There is memory overhead for each cluster, so your 20 separate clusters would be wasteful (even more so if there are any static data tables all your scripts need to use). You'd also have to guess the correct amount to give to each one.

On the other hand, if your 20 scripts are each going to be referring to a specific table, eg loading it with their own data, and generally assuming they are the only script running, you will have a problem: you either need to modify the scripts to be well-behaved or run each on its own ip/port.

I am not sure how to automate initialization and shutting down of the H2O cluster for so many batches. The first batch has to create the cluster (H2O.init() and the last batch has to shut it down)

Start H2O from the commandline before running the first script, and manually kill it after all scripts have completed. By doing it this way, each script will discover it is already running when they do their h2o.init() call.

If you have to be fully automatic, make sure the launch command will run first, but you'll need some kind of watcher script to notice when all the other processes have completed. (I tend to run a combination of ps and grep on cron jobs; there are more sophisticated ways, of course.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM