简体   繁体   中英

Using SLURM arrays to divide a R script into sub-jobs?

I have a R script I would like to divide into several jobs, each one on a node of the cluster.

res<-foreach(i = seq_len(nrow(combs))) %dopar% {
 G1 <- split[[combs[i,1]]]
 G2 <- split[[combs[i,2]]]
 bind <- cbind(data[,G1], data[,G2])
 rho.i <- cor_rho(bind)     #the function cor_rho I created  
 }

This is the code I would like to parallelize. I divide a big matrix into submatrices, and I compute the correlations between each combination of these submatrices :

submatrix 1 vs submatrix 2 : node 1 submatrix 1 vs submatrix 3 : node 2 etc.

I tried something like this (if I have 10 combinations to compute for example) , I don't show you the whole SLURM code :

#SBATCH --array=1-10

Rscript my_R_script > my_output

It creates 10 arrays, but I wonder if each array computes one computation. In other words, if one array = one node = one comparison between two submatrices?

Bests

Edit :

This is how combs looks like :

> combs
      [,1] [,2]
 [1,]    1    2
 [2,]    1    3
 [3,]    1    4
 [4,]    1    5
 [5,]    2    3
 [6,]    2    4
 [7,]    2    5
 [8,]    3    4
 [9,]    3    5
[10,]    4    5


combs <- combs[opt$subset,] #SLURM_ARRAY_TASK_ID

#The loop which computes each combination

res <- foreach(i = seq_len(nrow(combs))) %dopar% {
 G1 <- split[[combs[i,1]]]
 G2 <- split[[combs[i,2]]]
 dat.i <- cbind(data[,G1], data[,G2])
 rho <- cor_rho(dat.i)
}

#I fill my final matrix

resMAT <- matrix(0, ncol(data), ncol(data))

for(i in 1:nrow(combs)){
 batch1 <- split[[combs[i,1]]]
 batch2 <- split[[combs[i,2]]]
 patch.i <- c(batch1, batch2)
 resMAT[patch.i, patch.i] <- res[[i]]
}

Then, my SLURM code :

#!/bin/bash
#SBATCH -o slurmjob-%A-%a.out
#SBATCH --job-name=parallel_nodes
#SBATCH --partition=normal
#SBATCH --time=1-00:00:00
#SBATCH --array=1-10

#Set up whatever package we need to run with

module load gcc/8.1.0 openblas/0.3.3 R

# SET UP DIRECTORIES

OUTPUT="$HOME"/PROJET_M2/data/$(date +"%Y%m%d")_parallel_nodes
mkdir -p "$OUTPUT"

echo $SLURM_ARRAY_TASK_ID

subset=$((SLURM_ARRAY_TASK_ID))

Rscript my_R_code > "$OUTPUT"/"$SLURM_ARRAY_TASK_ID"

I execute this script with :

sbatch --partition normal --array 1-10 RHO_COR.sh

And I get a message error :

Error in combs[i, 1] : index out of bounds

I wonder if each array computes one computation.

Each run of the array can run one (or potentially multiple scripts).

In other words, if one array = one node = one comparison between two submatrices?

Yes, you can do it like this. Though you probably want to specify which one of the comparisons.

I really don't know how I could specify which comparison I need to compute on which array..

There are many ways to specify which comparison you need to compute on which array. For example you could use the array number as an argument/criterion for selection. Eg you would have a list of n comparisons and a list of n array numbers, and based on array number you choose a corresponding comparison which has the same index/position within the list. NB: you probably also want to name your outputs appropriately - otherwise you will try to create n different output files all with the same name, which can cause troubles if they are in the same location.

And I get a message error : Error in combs[i, 1] : index out of bounds

This is caused by the mismatch between the dimensions of combs after subset and your indices. Eg you are trying to access the position of combs which doesn´t exist.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM