I'm running an R-code on a Slurm Cluster with the following ".sh" file:
#!/bin/bash
#SBATCH --partition=p_parallel
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --workdir=/work/uder2/ODE/lancio/
module load statistics/r-3.6.1
srun Rscript TEST.R
The R-code is quite simple. Sometimes like
DIRbase = "/work/uder2/ODE/"
DIRdata = paste(DIRbase,"data/",sep="")
list.files(DIRdata)
load(paste(DIRdata,"Data.Rdata",sep=""))
NAME = "PriorU"
ialg = 3
nG = 500
LimEta = 40
LimMu2 = 15
LimMin = 500
LimMu = 0.1
LimSpike = 10
LimSigma2 = (8)^2/(-2*log(LimMu))*1.2
NAME = paste(NAME,"_ng",nG, sep="")
### ### ### ### ### ### ### ###
### MODELS
### ### ### ### ### ### ### ###
DATA = allGenesData
nrowData = nrow(DATA$premature)
sd1 = as.numeric(apply(DATA$premature,1,var))
sd2 = as.numeric(apply(DATA$mature,1,var))
sd3 = as.numeric(apply(DATA$nascent,1,var))
epsi = 0.000001
App = c(which(sd1<=epsi),which(sd2<=epsi),which(sd3<=epsi))
App2 = c(which(sd1>50),which(sd2>100000),which(sd3>1500))
minep = 0.1
xy1 = as.numeric(apply(DATA$premature,1,min))
xy2 = as.numeric(apply(DATA$mature,1,min))
xy3 = as.numeric(apply(DATA$nascent,1,min))
App3 = c(which(xy1<=minep),which(xy2<=minep),which(xy3<=minep))
In actuality, the code is much longer, but I don't think the content of the file is important.
What is happening is that, sometimes, the code is not written properly. For example, instead of
App3 = c(which(xy1<=minep),which(xy2<=minep),which(xy3<=minep))
is read
App3 which(xy1<=minep),which(xy2<=minep),which(xy3<=minep))
Then, without touching the code and launching again the ".sh" file, the code is read properly. This happens "randomly", and never in the same section of the code.
It seems it is related to the code length.
Any help?
Thanks
EDIT 1:
As an example, the output of a slurm file is
[1] "Data.Rdata"
Loading required package: MASS
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2020 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##
Loading required package: stats4
null device
1
Error: unexpected symbol in:
" Beta0 = rep(-4,3),
Betagonale Psi"
Execution halted
srun: error: node02: task 0: Exited with exit code 1
and the code is
priors = list(
Beta0 = list(
type = "Normal",
Par1 = rep(-4,3),
Par2 = rep(10,3)
),
Beta1 = list(
type = "Normal",
Par1 = rep(1.8,3),
Par2 = rep(10,3)
),
VarK = list(
type = "TruncatedNormal",
Par1 = rep(0,3),
Par2 = rep(100,3),
Par3 = rep(0.0000000,3),
Par4 = rep(LimSigma2,3),
Par5 = rep(2,3)
#Par5 = rep(2,3)
),
RegCoef = list(
type = "Normal",
Par1 = c(0,0,0,0,0), ## (1 o stessa dimension)
Par2 = rep(100,5)
),
sigmaMat = list(
type = "InverseWishart",
Par1 = rep(10,3),
Par2 = c(diag(1,5)) ## diagonale Psi
),
DPpar = list(
type = "Gamma",
Par1 = 1,
Par2 = 1 ## diagonale Psi
)
)
The symptom described here, a file stored on an NFS server is corrupt when read, is most of the time associated with race conditions on the file. Typically the file is open for writing from one NFS client (the login node) and open for reading from another client (a compute node). As there is no global lock mechanism in NFS, the client that is reading the file does not know that the file is being written. With advanced editors that support auto-save, the file can sometimes be written on disk in an inconsistant state, for instance in the middle of a copy/paste operation.
One option in that scenario is to avoid modifying the file at all while jobs are submitted or at least to deactivate auto-save.
Another option is to make a copy of the file before the job is submitted so that it is not updated afterwards.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.