繁体   English   中英

zmq.error.ZMQError:地址已在使用中,当使用 papermill 对多个笔记本运行多处理时

[英]zmq.error.ZMQError: Address already in use, when running multiprocessing with multiple notebooks using papermill

我正在使用 papermill 库同时使用多处理运行多个笔记本。

这发生在 Docker 容器内的 Python 3.6.6、Red Hat 4.8.2-15 上。

但是,当我运行 python 脚本时,由于我收到此错误,大约 5% 的笔记本无法立即工作(没有 Jupyter Notebook 单元运行):

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-124>", line 2, in initialize
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 469, in initialize
    self.init_sockets()
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

随着:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "main.py", line 77, in run_papermill
    pm.execute_notebook(notebook, output_path, parameters=config)
  File "/opt/conda/lib/python3.6/site-packages/papermill/execute.py", line 104, in execute_notebook
    **engine_kwargs
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 304, in execute_notebook
    nb = cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 372, in execute_managed_notebook
    preprocessor.preprocess(nb_man, safe_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/preprocess.py", line 20, in preprocess
    with self.setup_preprocessor(nb_man.nb, resources, km=km):
  File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
    self.km, self.kc = self.start_new_kernel(**kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
    kc.wait_for_ready(timeout=self.startup_timeout)
  File "/opt/conda/lib/python3.6/site-packages/jupyter_client/blocking/client.py", line 104, in wait_for_ready
    raise RuntimeError('Kernel died before replying to kernel_info')
RuntimeError: Kernel died before replying to kernel_info

请帮我解决这个问题,因为我已经搜索了 web 并尝试了不同的解决方案,但到目前为止没有一个对我的案例有效。

无论我同时运行的笔记本电脑数量或计算机上的内核数量如何,都会出现 5% 的错误率,这让我更加好奇。

我尝试更改启动方法并更新库,但无济于事。

我的库的版本是:

papermill==1.2.1
ipython==7.14.0
jupyter-client==6.1.3

谢谢!

明确的问题归因于 ZeroMQ 无法成功.bind()

错误信息: zmq.error.ZMQError: Address already in use更容易解释。 而 ZeroMQ AccessPoint-s 可以,出于显而易见的原因自由地尝试.connect()到许多对应物,但只有一个可以.bind()到特定传输类的地址目标。

发生这种情况有三个潜在原因:

1)意外调用了一些代码(不知道内部细节)
通过{ multiprocessing.Process | joblib.Parallel | Docker-wrapped |... } { multiprocessing.Process | joblib.Parallel | Docker-wrapped |... } { multiprocessing.Process | joblib.Parallel | Docker-wrapped |... } -spawned 副本,每个副本都试图获取一些 ZeroMQ 传输 Class 地址的所有权,由于显而易见的原因,在第一次成功后任何尝试都将失败。

2)一个相当致命的情况,其中一些“以前”运行的进程没有设法释放这样的传输 Class 特定地址以供进一步使用(不要记住 ZeroMQ 可能只是更多其他感兴趣的候选者之一 - 配置管理缺陷),或者在这种情况下,之前的运行未能优雅地终止此类资源使用并留下一个Context() -instance 仍在等待(在某些情况下无限地直到 O/S 重新启动)侦听某些东西,这将永远不会发生。

3)模块软件设计中确实糟糕的工程实践,不处理 ZeroMQ API 记录的EADDRINUSE错误/异常,而不是让整个马戏团崩溃(不惜一切代价)


The other error message: RuntimeError: Kernel died before replying to kernel_info related to a state, that the notebook's kernel was trying so long to establish all internal connections with its own components ( pool-peers ) that it took waiting longer than a configured or hardcoded timeout 并且内核进程只是停止等待,并将自己投入到您观察和报告的其他未处理的异常中。

解决方案

首先检查任何挂起的地址所有者,如果对此有疑问,请重新启动所有节点,接下来验证您自己的代码中没有“隐藏”的冲突尝试 / { multiprocessing.Process() | joblib.Parallel() |... the likes } { multiprocessing.Process() | joblib.Parallel() |... the likes } ,在分发后可能会尝试.bind()到同一个目标上。 如果这些步骤都不能解决您控制范围内的问题,请询问模块的使用支持,以分析并帮助您重构和验证您仍然存在冲突的用例。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM