简体   繁体   English

slurm 作业因“总线错误”而崩溃是什么意思?

[英]What does it mean for slurm job to crash with `bus error`?

When running a Python script via slurm srun --pty bash I get a cryptic error message Bus error: core dumped .通过 slurm srun --pty bash运行 Python 脚本时,我收到一条神秘的错误消息Bus error: core dumped

I searched the slurm documentation and it doesn't mention this error type.我搜索了 slurm 文档,它没有提到这种错误类型。

What's going on and how can I fix it?发生了什么事,我该如何解决?

I found this general information on the bus error , but that doesn't explain how and why it happens in a SLURM environment and what can be done to avoid it: What is a bus error?我发现了有关bus error的一般信息,但这并不能解释它在 SLURM 环境中如何以及为什么会发生以及可以采取哪些措施来避免它:什么是总线错误? Is it different from a segmentation fault?它与分段错误不同吗?

In at least one case, this was probably due to my job requiring too much memory and thus getting killed by SLURM.至少在一种情况下,这可能是由于我的工作需要太多内存,因此被 SLURM 杀死。

When I ran the same job that crashed with bus error on a worker node directly, it got killed after clamining >30GB.当我直接在工作节点上运行因总线错误而崩溃的同一个作业时,它在声明 >30GB 后被杀死。

Helpful answer from Ben Evans on the Yale cluster Discourse that may apply more generally to other clusters: Ben Evans 关于耶鲁集群话语的有用回答可能更普遍地适用于其他集群:

On the Yale clusters, a bus error usually means your job ran out of memory (RAM).在 Yale 集群上,总线错误通常意味着您的作业内存 (RAM) 不足。 If you cannot reduce the memory usage of your code, you can request additional memory for your job using the --mem-per-cpu or --mem Slurm flags.如果您无法减少代码的内存使用量,您可以使用 --mem-per-cpu 或 --mem Slurm 标志为您的作业请求额外的内存。

More details: Your program can run into this fault because of the way we manage memory with cgroups 7 so that many jobs can be run on the same physical machine without interfering with one another.更多细节:由于我们使用 cgroups 7 管理内存的方式,您的程序可能会遇到此错误,因此许多作业可以在同一台物理机器上运行而不会相互干扰。 If a process inside a job tries to access memory “outside” what was allocated to that job, eg more than what you requested, the operating system tells your program that address is invalid with the fault Bus Error, aka SIGBUS, exit(10).如果作业内的进程试图访问分配给该作业的内存“外部”的内存,例如超过您请求的内存,则操作系统会告诉您的程序地址无效,错误总线错误,即 SIGBUS,退出(10) . A similar fault you might be more familiar with is a Segmentation Fault, aka SIGSEGV, exit(11) which usually results from a program incorrectly trying to access a valid memory address.您可能更熟悉的类似故障是分段故障,即 SIGSEGV,exit(11),通常是由于程序错误地尝试访问有效内存地址造成的。

https://ask.cyberinfrastructure.org/t/what-does-it-mean-when-i-get-a-bus-error-in-my-job/1101/2 https://ask.cyberinfrastructure.org/t/what-does-it-mean-when-i-get-a-bus-error-in-my-job/1101/2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM