简体   繁体   English

同时运行的多个Python实例限制为35个

[英]Multiple instances of Python running simultaneously limited to 35

I am running a Python 3.6 script as multiple separate processes on different processors of a parallel computing cluster. 我在并行计算集群的不同处理器上运行Python 3.6脚本作为多个单独的进程。 Up to 35 processes run simultaneously with no problem, but the 36th (and any more) crashes with a segmentation fault on the second line which is import pandas as pd . 最多35个进程同时运行没有问题,但第36行(以及更多)在第二行崩溃并且import pandas as pd Interestingly, the first line import os does not cause an issue. 有趣的是,第一行import os不会引起问题。 The full error message is: 完整的错误消息是:

OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
Traceback (most recent call last):
  File "/home/.../myscript.py", line 32, in <module>
    import pandas as pd
  File "/home/.../python_venv2/lib/python3.6/site-packages/pandas/__init__.py", line 13, in <module>
    __import__(dependency)
  File "/home/.../python_venv2/lib/python3.6/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/home/.../python_venv2/lib/python3.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/home/.../python_venv2/lib/python3.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/home/.../python_venv2/lib/python3.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/home/.../python_venv2/lib/python3.6/site-packages/numpy/core/__init__.py", line 16, in <module>
    from . import multiarray
SystemError: initialization of multiarray raised unreported exception
/var/spool/slurmd/job04590/slurm_script: line 11: 26963 Segmentation fault      python /home/.../myscript.py -x 38

Pandas and a few other packages are installed in a virtual environment. Pandas和一些其他软件包安装在虚拟环境中。 I have duplicated the virtual environment, so that there are no more than 24 processes running in each venv. 我复制了虚拟环境,因此每个venv中运行的进程不超过24个。 For example, the error script above came from a script running in the virtual environment called python_venv2 . 例如,上面的错误脚本来自在虚拟环境中运行的名为python_venv2的脚本。

The problem occurs on the 36th process every time regardless of how many of the processes are importing from the particular instance of Pandas. 无论有多少进程从特定的Pandas实例导入,每次都会在第36个进程上发生此问题。 (I am not even making a dent in the capacity of the parallel computing cluster.) (我甚至没有削弱并行计算集群的能力。)

So, if it is not a restriction on the number of processes accessing Pandas, is it a restriction on the number of processes running Python? 因此,如果它不是对访问Pandas的进程数量的限制,那么它是否限制了运行Python的进程数量? Why is 35 the limit? 为什么35是限制?

Is it possible to install multiple copies of Python on the machine (in separate virtual environments?) so that I can run more than 35 processes? 是否可以在机器上安装多个Python副本(在单独的虚拟环境中?),这样我就可以运行超过35个进程?

Decomposing the Error Message 分解错误消息

Your error message includes the following hint: 您的错误消息包括以下提示:

OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max

The RLIMIT_NPROC variable controls the total number of processes that user can have. RLIMIT_NPROC变量控制用户可以拥有的进程总数。 More specifically, as it is a per process setting, when fork() , clone() , vfork() , &c are called by a process, the RLIMIT_NPROC value for that process is compared to the total process count for that process's parent user. 更具体地说,由于它是每个进程设置,当进程调用fork()clone()vfork()和c时,该进程的RLIMIT_NPROC值将与该进程的父用户的总进程数进行比较。 If that value is exceeded, things shut down, as you've experienced. 如果超过这个值,事情会因为你的经历而关闭。

The error message indicates that OpenBLAS was unable to create additional threads because your user had used all the threads RLIMIT_NPROC had given it. 该错误消息表明OpenBLAS无法创建其他线程,因为您的用户已使用RLIMIT_NPROC提供的所有线程。

Since you're running on a cluster, it's unlikely that your user is running many threads (unlike, say, if you were on your personal machine and browsing the web, playing music, &c), so it's reasonable to conclude that OpenBLAS is trying to start multiple threads. 由于您在群集上运行,因此您的用户不太可能运行多个线程(例如,如果您在个人计算机上并浏览网页,播放音乐等),因此可以合理地断定OpenBLAS正在尝试启动多个线程。

How OpenBLAS Uses Threads OpenBLAS如何使用线程

OpenBLAS can use multiple threads to accelerate linear algebra. OpenBLAS可以使用多个线程来加速线性代数。 You may want many threads for solving a single, larger problem quickly. 您可能需要许多线程来快速解决单个更大的问题。 You may want fewer threads for solving many smaller problems simultaneously. 您可能需要更少的线程来同时解决许多小问题。

OpenBLAS has several ways to limit the number of threads it uses. OpenBLAS有几种方法可以限制它使用的线程数。 These are controlled via: 通过以下方式控制:

export OPENBLAS_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OMP_NUM_THREADS=4

The priorities are OPENBLAS_NUM_THREADS > GOTO_NUM_THREADS > OMP_NUM_THREADS. 优先级为OPENBLAS_NUM_THREADS> GOTO_NUM_THREADS> OMP_NUM_THREADS。 (I think this means that OPENBLAS_NUM_THREADS overrides OMP_NUM_THREADS ; however, OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1 .) (我认为这意味着OPENBLAS_NUM_THREADS会覆盖OMP_NUM_THREADS ;但是,当使用USE_OPENMP=1编译时,OpenBLAS会忽略OPENBLAS_NUM_THREADSGOTO_NUM_THREADS 。)

If none of the foregoing variables are set, OpenBLAS will run using a number of threads equal to the number of cores on your machine (32 on your machine) 如果没有设置上述变量,OpenBLAS将使用多个线程运行,这些线程等于机器上的核心数(机器上的32个)

Your Situation 你的情况

Your cluster has 32-core CPUs. 您的群集具有32核CPU。 You're trying to run 36 instances of Python. 您正在尝试运行36个Python实例。 Each instance requires 1 thread for Python + 32 threads for OpenBLAS. 每个实例需要1个线程用于Python + 32个线程用于OpenBLAS。 You'll also need 1 thread for your SSH connection and 1 thread for your shell. 您还需要一个用于SSH连接的线程和一个用于shell的线程。 That means that you need 36*(32+1)+2=1190 threads. 这意味着您需要36 *(32 + 1)+ 2 = 1190个线程。

The nuclear option for fixing the problem is to use: 解决问题的核选择是使用:

export OPENBLAS_NUM_THREADS=1

which should bring you down to 36*(1+1)+2=74 threads. 这应该会降低到36 *(1 + 1)+ 2 = 74个线程。

Since you have spare capacity, you could adjust OPENBLAS_NUM_THREADS to a higher value, but then the OpenBLAS instances owned by your separate Python processes will interfere with each other. 由于您具有备用容量,因此可以将OPENBLAS_NUM_THREADS调整为更高的值,但随后由单独的Python进程拥有的OpenBLAS实例将相互干扰。 So there's a trade-off between how fast you get one solution versus how fast you can get many solutions. 因此,在获得一个解决方案的速度与获得多个解决方案的速度之间需要进行权衡。 Ideally, you can solve this trade-off by running fewer Pythons per node and using more nodes. 理想情况下,您可以通过每个节点运行更少的Pythons并使用更多节点来解决此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM