Paralelizing an Rscript using a job array in Slurm

Question

I want to run an Rscript.R using an array job in Slurm, with 1-10 tasks, whereby the task id from the job will be directed to the Rscript, to write a file named "'task id'.out", containing 'task id' in its body. However, this has proven to be more challenging than I anticipated haha I am trying the following:

~/bash_test.sh looks like:

#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test 
R CMD BATCH --no-save --no-restore ~/Rscript_test.R $SLURM_ARRAY_TASK_ID

~/Rscript_test.R looks like:

#!/usr/bin/env Rscript
taskid = commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
taskid <- as.data.frame(taskid)
# print task number
print(paste0("the number processed was... ", taskid))
write.table(taskid, paste0("~/test/",taskid,".out"),quote=FALSE, row.names=FALSE, col.names=FALSE)

After I submit my job ( sbatch bash_test.sh ), it looks like R is not really seeing SLURM_ARRAY_TASK_ID . The script is generating 10 files (1, 2, ..., 10 - just numbers - probably corresponding to the task ids), but it's not writing the files with the extension ".out": the script wrote an empty "integer(0).out" file.

What I wanted, was to populate the folder ~/test/ with 10 files, 1.out, 2.out, ..., 10.out, and each file has to contain the task id inside (simply the number 1, 2, ..., or 10, respectively).

PS: Note that I tried playing with Sys.getenv() too, but I don't think I was able to set that up properly. That option generates 10 files, and one 1.out file, containing number 10.

PS2: This is slurm 19.05.5. I am running R wihthin a conda environment.

Answer 1

You should avoid using "R CMD BATCH". It doesn't handle arguments the way most functions do. "Rscript" has been the recommended option for a while now. By calling "R CMD BATCH" you are basically ignoring the "#./usr/bin/env Rscript" part of your script.

So change your script file to

#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test 
Rscript ~/Rscript_test.R $SLURM_ARRAY_TASK_ID

And then becareful in your script that you aren't using the same variable as both a string a data.frame. You can't easily paste a data.frame into a file path for example. So

taskid <- commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID')  # This should also work

print(paste0("the number processed was... ", taskid))

outdata <- as.data.frame(taskid)
outfile <- paste0("~/test/", taskid, ".out")

write.table(outdata, outfile, quote=FALSE, row.names=FALSE, col.names=FALSE)

The extra files with just the array number were created because the usage of R CMD BATCH is

R CMD BATCH [options] infile [outfile]

So the $SLURM_ARRAY_TASK_ID value you were passing at the command line was treated as the outfile name. Instead that value needed to be passed as options. But again, it's better to use Rscript which has more standard argument conventions.

Paralelizing an Rscript using a job array in Slurm

Question

1 answers

solution1
3 ACCPTED 2021-02-11 18:25:23

Paralelizing an Rscript using a job array in Slurm

Question

1 answers

solution1 3 ACCPTED 2021-02-11 18:25:23

solution1
3 ACCPTED 2021-02-11 18:25:23