使用Slurm时调试R代码

Question

I am running simulations in R on a cluster. 我正在集群中的R运行仿真。 Each R file contains 100 models. 每个R文件包含100个模型。 Each model analyses a different data set. 每个模型分析不同的数据集。 Cluster commands are included in a slurm file, shown below. 群集命令包含在slurm文件中，如下所示。

A small percentage of models apparently do not converge well enough to estimate the Hessian and an error is generated for these models. 一小部分模型显然收敛得不够好，无法估计Hessian，因此这些模型会产生误差。 The errors are placed in an error log file. 错误将放置在错误日志文件中。 However, I cannot determine from looking at the parameter estimates, the error log file and the output log file which of the 100 models are generating the errors. 但是，我无法通过查看参数估计值，错误日志文件和输出日志文件来确定100个模型中的哪个正在生成错误。

Here is an example of an error message 这是错误消息的示例

Error in chol.default(fit$hessian) : 
  the leading minor of order 3 is not positive definite
Calls: chol2inv -> chol -> chol.default

Parameter estimates are returned despite these errors. 尽管存在这些错误，但仍返回参数估计值。 Some SE's are huge, but I think the SE's can be large sometimes even when an error message is not returned. 一些SE很大，但是我认为即使没有返回错误消息，SE有时也可能很大。

Is it possible to include an additional line in my slurm file below that will generate a log file containing both the errors and the rest of the output with the errors shown in their original location (for example, the location in which they are shown on my Windows laptop). 是否可以在下面的我的slurm文件中包括一行附加内容，以生成一个日志文件，其中包含错误和输出的其余部分，并且错误显示在其原始位置（例如，在我的计算机上显示错误的位置） Windows笔记本电脑）。 That way I would be able to determine quickly which models were generating the errors by looking at the log file. 这样，我可以通过查看日志文件来快速确定哪些模型正在产生错误。 I have been trying to think of a work-around, but have not been able to come up with anything so far. 我一直在尝试解决问题的方法，但到目前为止还无法提出任何建议。

Here is a slurm file: 这是一个slurm文件：

#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH -e JS_N200_301_400_Oct31_17c.err
#SBATCH --mail-user markwm@myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R

Answer 1

Not sure if this is what you want, but R option error allows to control what should happen with errors (that you don't catch otherwise). 不确定这是否是您想要的，但是R选项error允许控制应发生的错误（否则您不会捕获）。 For instance, setting 例如，设置

options(error = function() {
  traceback(2L)
  dump.frames(dumpto = "last.dump", to.file = TRUE)
})

at the beginning of your *.R script, or in a .Rprofile startup script, will (a) output the traceback if there's an error, but more importantly, it'll also (b) dump the call stack to file last.dump.rda , which you can load in a fresh R session as: 在* .R脚本的开头或.Rprofile启动脚本中，将（a）如果出现错误则输出回溯，但更重要的是，它还将（b）将调用堆栈转储到文件last.dump.rda ，您可以在新的R会话中加载为：

dump <- get(load("last.dump.rda"))

Note, that get(load( is not a mistake. Here dump is an object of class dump.frames which allows you to inspect the call stack and its content. 请注意， get(load(是不是一个错误，在这里dump的类的对象dump.frames它允许你检查调用堆栈和它的内容。

You can of course customize error to do other things. 您当然可以自定义error以执行其他操作。

Answer 2

I learned from an IT person in charge of the cluster that I can have the error messages added to the output log by simply removing the reference to the error log in the slurm file. 我从集群的IT负责人那里了解到，只需删除slurm文件中对错误日志的引用，就可以将错误消息添加到输出日志中。 See below. 见下文。 It seems to be good enough. 似乎足够好。

I plan to also output the model number into the log at the beginning and the end of each model's output for added clarity (which I should have been doing from the start). 我还计划在每个模型输出的开头和结尾将型号编号输出到日志中，以提高清晰度（我从一开始就应该这样做）。

#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH --mail-user markwm@myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R

使用Slurm时调试R代码

问题描述

2 个解决方案

解决方案1
1 2017-11-01 07:52:44

解决方案2
0 2017-11-02 08:21:07

使用Slurm时调试R代码

问题描述

2 个解决方案

解决方案1 1 2017-11-01 07:52:44

解决方案2 0 2017-11-02 08:21:07

解决方案1
1 2017-11-01 07:52:44

解决方案2
0 2017-11-02 08:21:07