[英]Slurm job array fails to run Rscript with shapefiles
I would like to run a job array via Slurm on an HPC cluster, intersecting individual circle shapefiles with a large shapefile of Census blocks, then saving the resulting intersection shapefile.我想通过 Slurm 在 HPC 集群上运行作业数组,将单个圆形 shapefile 与人口普查块的大型 shapefile 相交,然后保存生成的相交 shapefile。 I will then combine these individual shapefiles into one large one on my own machine.
然后,我将在我自己的机器上将这些单独的 shapefile 组合成一个大文件。 This is a way to avoid the parallelization problems I describe in an earlier question: mapply error on list from sf (simple features) object in R
这是一种避免我在前面的问题中描述的并行化问题的方法: 从 R 中的 sf(简单特征)对象的列表中的映射错误
However, when running the job array, I receive the following error:但是,在运行作业数组时,我收到以下错误:
sbatch: error: Batch job submission failed: Invalid job array specification
Here is a link to the R script, .sh file, and filename csv I am using on my HPC cluster: https://github.com/msghankinson/slurm_job_array .这是我在 HPC 集群上使用的 R 脚本、.sh 文件和文件名 csv 的链接: https ://github.com/msghankinson/slurm_job_array。
The R code relies on 3 files: R 代码依赖于 3 个文件:
I've run the R code on specific, individual buffer and lihtc shapefiles and the function works.我已经在特定的单个缓冲区和 lihtc shapefile 上运行了 R 代码,并且该功能有效。 So my main focus is the .sh file launching the job array ("lihtc_array_example.sh").
所以我的主要关注点是启动作业数组的 .sh 文件(“lihtc_array_example.sh”)。 Here, I am trying to run my R script on each "buffer" shapefile using the task ID and the "master_example.csv" (also in the reprex) to define which files are loaded into R. Each row of master_example.csv contains the buffer filename and the lihtc filename I need.
在这里,我尝试使用任务 ID 和“master_example.csv”(也在 reprex 中)在每个“缓冲区”shapefile 上运行我的 R 脚本,以定义将哪些文件加载到 R 中。master_example.csv 的每一行都包含缓冲区文件名和我需要的 lihtc 文件名。 These filenames need to be passed to the R script and used to load the correct files for each intersection.
这些文件名需要传递给 R 脚本并用于为每个交叉点加载正确的文件。 Eg, Task 1 loads files listed on in row 1 of master_example.csv.
例如,任务 1 加载 master_example.csv 的第 1 行中列出的文件。 The code I found tries to pull these names in the .sh file via:
我发现的代码试图通过以下方式将这些名称提取到 .sh 文件中:
shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )
While I understand that it is difficult to run the reprex, I would like to know if there are any clear breakdowns in the pipeline between the .sh file, the csv of names, and the R script?虽然我知道运行 reprex 很困难,但我想知道 .sh 文件、名称的 csv 和 R 脚本之间的管道是否有任何明显的故障? I am happy to provide any additional information which may be helpful.
我很乐意提供任何可能有用的附加信息。
Full .sh file, for ease of access:完整的 .sh 文件,便于访问:
#SBATCH -t 2:00:00
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -o jobArrayScript_%A_%a.out
#SBATCH -e jobArrayScript_%A_%a.err
#SBATCH -a 1-3086%1000
line_N=$( awk "NR==$SLURM_ARRAY_TASK_ID" master_example.csv ) # NR means row-# in Awk
shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )
module load R/4.1.1
module load libudunits2/2.2.28
module load gdal/3.5.0
module load proj/6.3.0
module load geos/3.10.3
Rscript slurm_job_array.R $shp_filename $lihtc_filename
For reference:以供参考:
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.9 ggmap_3.0.0 ggplot2_3.3.6 sf_1.0-7
loaded via a namespace (and not attached):
[1] xfun_0.28 tidyselect_1.1.2 purrr_0.3.4 lattice_0.20-45 colorspace_2.0-3 vctrs_0.4.1 generics_0.1.2
[8] htmltools_0.5.2 s2_1.0.7 utf8_1.2.2 rlang_1.0.2 e1071_1.7-9 pillar_1.7.0 glue_1.6.2
[15] withr_2.5.0 DBI_1.1.1 sp_1.4-6 wk_0.5.0 jpeg_0.1-9 lifecycle_1.0.1 plyr_1.8.7
[22] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 RgoogleMaps_1.4.5.3 evaluate_0.15 knitr_1.36 fastmap_1.1.0
[29] curl_4.3.2 class_7.3-19 fansi_1.0.3 highr_0.9 Rcpp_1.0.8.3 KernSmooth_2.23-20 scales_1.2.0
[36] classInt_0.4-3 farver_2.1.0 rjson_0.2.20 png_0.1-7 digest_0.6.29 stringi_1.7.6 grid_4.1.0
[43] cli_3.3.0 tools_4.1.0 bitops_1.0-7 magrittr_2.0.3 proxy_0.4-26 tibble_3.1.7 crayon_1.5.1
[50] tidyr_1.2.0 pkgconfig_2.0.3 ellipsis_0.3.2 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2 rstudioapi_0.13
[57] R6_2.5.1 units_0.7-2 compiler_4.1.0
3 problems identified and now solved:发现并解决了 3 个问题:
Max array size refers to the entire array.最大数组大小是指整个数组。 The throttle just sets how many jobs get scheduled at one time.
油门只是设置一次安排多少作业。 So I needed to break my 3,086 job task into 4 separate batches.
所以我需要将我的 3,086 个工作任务分成 4 个单独的批次。 This can be done in the .sh file as:
#SBATCH -a 1-999
for job 1 #SBATCH -a 1000-1999
for job 2, and so on.这可以在 .sh 文件中按以下方式完成:
#SBATCH -a 1-999
for job 1 #SBATCH -a 1000-1999
for job 2,依此类推。
The R script needs to catch the arguments from the command line. R 脚本需要从命令行捕获参数。 The script now begins:
args = commandArgs(trailingOnly=TRUE) shp_filename <- args[1] lihtc_filename <- args[2]
脚本现在开始:
args = commandArgs(trailingOnly=TRUE) shp_filename <- args[1] lihtc_filename <- args[2]
The submission file was sending arguments with quotations, which was preventing paste0
from creating usable file names.提交文件正在发送带引号的参数,这阻止
paste0
创建可用的文件名。 Neither noquote()
nor print(x, quotes = F)
was able to remove these quotes. noquote()
和print(x, quotes = F)
都无法删除这些引号。 However gsub('"', '', x)
worked.但是
gsub('"', '', x)
起作用了。
An inelegant/lazy parallelization on my part, but it works.就我而言,这是一种不优雅/懒惰的并行化,但它确实有效。 Case closed.
结案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.