简体   繁体   中英

R-code in Slurm cluster not read properly

I'm running an R-code on a Slurm Cluster with the following ".sh" file:

#!/bin/bash
#SBATCH --partition=p_parallel
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --workdir=/work/uder2/ODE/lancio/
module load statistics/r-3.6.1
srun Rscript   TEST.R

The R-code is quite simple. Sometimes like

DIRbase     = "/work/uder2/ODE/"
DIRdata     = paste(DIRbase,"data/",sep="")
list.files(DIRdata)
load(paste(DIRdata,"Data.Rdata",sep=""))


NAME = "PriorU" 
ialg = 3

nG  = 500  
LimEta = 40  

LimMu2  = 15 
LimMin = 500

LimMu = 0.1
LimSpike = 10
LimSigma2 = (8)^2/(-2*log(LimMu))*1.2


NAME = paste(NAME,"_ng",nG, sep="")

### ### ### ### ### ### ### ### 
### MODELS
### ### ### ### ### ### ### ### 

DATA = allGenesData
nrowData = nrow(DATA$premature)


sd1 = as.numeric(apply(DATA$premature,1,var))
sd2 = as.numeric(apply(DATA$mature,1,var))
sd3 = as.numeric(apply(DATA$nascent,1,var))

epsi = 0.000001
App = c(which(sd1<=epsi),which(sd2<=epsi),which(sd3<=epsi))
App2 = c(which(sd1>50),which(sd2>100000),which(sd3>1500))

minep = 0.1
xy1 = as.numeric(apply(DATA$premature,1,min))
xy2 = as.numeric(apply(DATA$mature,1,min))
xy3 = as.numeric(apply(DATA$nascent,1,min))
App3 = c(which(xy1<=minep),which(xy2<=minep),which(xy3<=minep))

In actuality, the code is much longer, but I don't think the content of the file is important.

What is happening is that, sometimes, the code is not written properly. For example, instead of

App3 = c(which(xy1<=minep),which(xy2<=minep),which(xy3<=minep))

is read

App3  which(xy1<=minep),which(xy2<=minep),which(xy3<=minep))

Then, without touching the code and launching again the ".sh" file, the code is read properly. This happens "randomly", and never in the same section of the code.

It seems it is related to the code length.

Any help?

Thanks

EDIT 1:

As an example, the output of a slurm file is

[1] "Data.Rdata"
Loading required package: MASS
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2020 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##
Loading required package: stats4
null device 
          1 
Error: unexpected symbol in:
"      Beta0   = rep(-4,3),
      Betagonale Psi"
Execution halted
srun: error: node02: task 0: Exited with exit code 1

and the code is

priors  = list(
     Beta0 = list(
         type        = "Normal",
         Par1        = rep(-4,3),
         Par2        = rep(10,3)
       ),
       Beta1 = list(
         type        = "Normal",
         Par1        = rep(1.8,3), 
         Par2        = rep(10,3)
       ),
      VarK   = list(
        type        = "TruncatedNormal",
        Par1        = rep(0,3),
        Par2        = rep(100,3),
        Par3        = rep(0.0000000,3),
        Par4        = rep(LimSigma2,3), 
        Par5        = rep(2,3)
        #Par5        = rep(2,3)
      ), 
      RegCoef = list(
          type        = "Normal",
          Par1        = c(0,0,0,0,0), ## (1 o stessa dimension)
          Par2        = rep(100,5)
      ),
      sigmaMat = list(
          type        = "InverseWishart",
          Par1        = rep(10,3), 
          Par2        = c(diag(1,5)) ## diagonale Psi
      ),

      DPpar = list(
          type        = "Gamma",
          Par1        = 1, 
          Par2        = 1 ## diagonale Psi
      )
    ) 

The symptom described here, a file stored on an NFS server is corrupt when read, is most of the time associated with race conditions on the file. Typically the file is open for writing from one NFS client (the login node) and open for reading from another client (a compute node). As there is no global lock mechanism in NFS, the client that is reading the file does not know that the file is being written. With advanced editors that support auto-save, the file can sometimes be written on disk in an inconsistant state, for instance in the middle of a copy/paste operation.

One option in that scenario is to avoid modifying the file at all while jobs are submitted or at least to deactivate auto-save.

Another option is to make a copy of the file before the job is submitted so that it is not updated afterwards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM