简体   繁体   English

Hung cells:与 papermill 并行运行多个 jupyter notebook

[英]Hung cells: running multiple jupyter notebooks in parallel with papermill

I am trying to run jupyter notebooks in parallel by starting them from another notebook.我正在尝试通过从另一个笔记本启动它们来并行运行 jupyter 笔记本。 I'm using papermill to save the output from the notebooks.我正在使用造纸机从笔记本中保存output

In my scheduler.ipynb I'm using multiprocessing which is what some people have had success with.在我的 scheduler.ipynb 中,我使用了multiprocessing ,这是 一些人成功的方法。 I create processes from a base notebook and this seems to always work the 1st time it's run.我从基本笔记本创建进程,这似乎总是在第一次运行时工作。 I can run 3 notebooks with sleep 10 in 13 seconds.我可以在 13 秒内运行 3 个sleep 10的笔记本。 If I have a subsequent cell that attempts to run the exact same thing, the processes that it spawns (multiple notebooks) hang indefinitely.如果我有一个后续单元尝试运行完全相同的东西,它产生的进程(多个笔记本)会无限期地挂起。 I've tried adding code to make sure the spawned processes have exit codes and have completed, even calling terminate on them once they are done- no luck, my 2nd attempt never completes.我尝试添加代码以确保生成的进程具有退出代码并已完成,甚至在完成后对它们调用终止 - 不走运,我的第二次尝试永远不会完成。

If I do:如果我做:

sean@server:~$ ps aux | grep ipython 
root      2129  0.1  0.2 1117652 176904 ?      Ssl  19:39   0:05 /opt/conda/anaconda/bin/python /opt/conda/anaconda/bin/ipython kernel -f /root/.local/share/jupyter/runtime/kernel-eee374ff-0760-4490-8ed0-db03fed84f0c.json
root      3418  0.1  0.2 1042076 173652 ?      Ssl  19:42   0:03 /opt/conda/anaconda/bin/python /opt/conda/anaconda/bin/ipython kernel -f /root/.local/share/jupyter/runtime/kernel-3e2f09e8-969f-41c9-81cc-acd2ec4e3d54.json
root      4332  0.1  0.2 1042796 174896 ?      Ssl  19:44   0:04 /opt/conda/anaconda/bin/python /opt/conda/anaconda/bin/ipython kernel -f /root/.local/share/jupyter/runtime/kernel-bbd4575c-109a-4fb3-b6ed-372beb27effd.json
root     17183  0.2  0.2 995344 145872 ?       Ssl  20:26   0:02 /opt/conda/anaconda/bin/python /opt/conda/anaconda/bin/ipython kernel -f /root/.local/share/jupyter/runtime/kernel-27c48eb1-16b4-4442-9574-058283e48536.json

I see that there appears to be 4 running kernels (4 processes).我看到似乎有 4 个正在运行的内核(4 个进程)。 When I view the running notebooks, I see there are 6 running notebooks.当我查看正在运行的笔记本时,我看到有 6 个正在运行的笔记本。 This seems to be supported in the doc that a few kernels can service multiple notebooks .这似乎在文档中得到支持, 一些内核可以为多个笔记本提供服务 “A kernel process can be connected to more than one frontend simultaneously” “一个 kernel 进程可以同时连接到多个前端”

But, I suspect because ipython kernels continue to run, something bad is happening where spawned processes aren't being reaped?但是,我怀疑因为 ipython 内核继续运行,在没有获得衍生进程的地方发生了一些不好的事情? Some say it's not possible using multiprocessing. 有人说使用多处理是不可能的 Others have described the same problem .其他人也描述了同样的问题

import re
import os
import multiprocessing

from os.path import isfile
from datetime import datetime

import papermill as pm
import nbformat

# avoid "RuntimeError: This event loop is already running"
# it seems papermill used to support this but it is now undocumented: 
#  papermill.execute_notebook(nest_asyncio=True)
import nest_asyncio
nest_asyncio.apply()

import company.config


# # Supporting Functions

# In[ ]:


def get_papermill_parameters(notebook,
                             notebook_prefix='/mnt/jupyter',
                             notebook_suffix='.ipynb'):
  if isinstance(notebook, list):
    notebook_path = notebook[0]
    parameters = notebook[1]
    tag = '_' + notebook[2] if notebook[2] is not None else None
  else:
    notebook_path = notebook
    parameters = None
    tag = ''
    
  basename = os.path.basename(notebook_path)
  dirpath = re.sub(basename + '$', '', notebook_path)
  this_notebook_suffix = notebook_suffix if not re.search(notebook_suffix + '$', basename) else ''
  
  input_notebook = notebook_prefix + notebook_path + this_notebook_suffix
  scheduler_notebook_dir = notebook_prefix + dirpath + 'scheduler/'
  
  if not os.path.exists(scheduler_notebook_dir):
    os.makedirs(scheduler_notebook_dir)
    
  output_notebook = scheduler_notebook_dir + basename 
  
  return input_notebook, output_notebook, this_notebook_suffix, parameters, tag


# In[ ]:


def add_additional_imports(input_notebook, output_notebook, current_datetime):          
  notebook_name = os.path.basename(output_notebook) 
  notebook_dir = re.sub(notebook_name, '', output_notebook)
  temp_dir = notebook_dir + current_datetime + '/temp/'
  results_dir = notebook_dir + current_datetime + '/'
  
  if not os.path.exists(temp_dir):
    os.makedirs(temp_dir)
  if not os.path.exists(results_dir):
    os.makedirs(results_dir) 
    
  updated_notebook = temp_dir + notebook_name 
  first_cell = nbformat.v4.new_code_cell("""
    import import_ipynb
    import sys
    sys.path.append('/mnt/jupyter/lib')""")
        
  metadata = {"kernelspec": {"display_name": "PySpark", "language": "python", "name": "pyspark"}}
  existing_nb = nbformat.read(input_notebook, nbformat.current_nbformat)
  cells = existing_nb.cells
  cells.insert(0, first_cell)
  new_nb = nbformat.v4.new_notebook(cells = cells, metadata = metadata)
  nbformat.write(new_nb, updated_notebook, nbformat.current_nbformat)
  output_notebook = results_dir + notebook_name
  
  return updated_notebook, output_notebook


# In[ ]:


# define this function so it is easily passed to multiprocessing
def run_papermill(input_notebook, output_notebook, parameters):
  pm.execute_notebook(input_notebook, output_notebook, parameters, log_output=True)


# # Run All of the Notebooks

# In[ ]:


def run(notebooks, run_hour_utc=10, scheduler=True, additional_imports=False,
        parallel=False, notebook_prefix='/mnt/jupyter'):
  """
  Run provided list of notebooks on a schedule or on demand.

  Args:
    notebooks (list): a list of notebooks to run
    run_hour_utc (int): hour to run notebooks at
    scheduler (boolean): when set to True (default value) notebooks will run at run_hour_utc.
                         when set to False notebooks will run on demand.
    additional_imports (boolean): set to True if you need to add additional imports into your notebook
    parallel (boolean): whether to run the notebooks in parallel
    notebook_prefix (str): path to jupyter notebooks
  """
  if not scheduler or datetime.now().hour == run_hour_utc:  # Only run once a day on an hourly cron job.
    now = datetime.today().strftime('%Y-%m-%d_%H%M%S')
    procs = []
    notebooks_base_url = company.config.cluster['resources']['daedalus']['notebook'] + '/notebooks'
    
    if parallel and len(notebooks) > 10:
      raise Exception("You are trying to run {len(notebooks)}. We recommend a maximum of 10 be run at once.")

    for notebook in notebooks:
      input_notebook, output_notebook, this_notebook_suffix, parameters, tag = get_papermill_parameters(notebook, notebook_prefix)
      
      if is_interactive_notebook(input_notebook):
        print(f"Not running Notebook '{input_notebook}' because it's marked interactive-only.")
        continue
      
      if additional_imports:
        input_notebook, output_notebook = add_additional_imports(input_notebook, output_notebook, now)
      else:
        output_notebook = output_notebook + tag + '_' + now + this_notebook_suffix
      
      print(f"Running Notebook: '{input_notebook}'")
      print(" - Parameters: " + str(parameters))
      print(f"Saving Results to: '{output_notebook}'")
      print("Link: " + re.sub(notebook_prefix, notebooks_base_url, output_notebook))
    
      if not os.path.isfile(input_notebook):
        print(f"ERROR! Notebook file does not exist: '{input_notebook}'")
      else:
        try:
          if parameters is not None:
            parameters.update({'input_notebook':input_notebook, 'output_notebook':output_notebook})
          if parallel:
            # trailing comma in args is in documentation for multiprocessing- it seems to matter
            proc = multiprocessing.Process(target=run_papermill, args=(input_notebook, output_notebook, parameters,))
            print("starting process")
            proc.start()
            procs.append(proc)
            
          else:
            run_papermill(input_notebook, output_notebook, parameters)
            
        except Exception as ex:
          print(ex)
          print(f"ERROR! See full error in: '{output_notebook}'\n\n")
          
      if additional_imports:
        temp_dir = re.sub(os.path.basename(input_notebook), '', input_notebook)
        if os.path.exists(temp_dir):
          os.system(f"rm -rf '{temp_dir}'")
    
    if procs:
      print("joining")
      for proc in procs:
        proc.join()
    
    if procs:
      print("terminating")
      for proc in procs:
        print(proc.is_alive())
        print(proc.exitcode)
        proc.terminate()
    
    print(f"Done: Processed all {len(notebooks)} notebooks.")
    
  else:
    print(f"Waiting until {run_hour_utc}:00:00 UTC to run.")

I'm using python==3.6.12, papermill==2.2.2我正在使用 python==3.6.12,papermill==2.2.2

jupyter core     : 4.7.0
jupyter-notebook : 5.5.0
ipython          : 7.16.1
ipykernel        : 5.3.4
jupyter client   : 6.1.7
ipywidgets       : 7.2.1

Have you tried using the subprocess module?您是否尝试过使用subprocess模块? It seems like a better option for you instead of multiprocessing.对您来说,这似乎是一个更好的选择,而不是多处理。 It allows you to asynchronously spawn sub-processes that will run in parallel, this can be used to invoke commands and programs as if you were using the shell.它允许您异步生成将并行运行的子进程,这可用于调用命令和程序,就像您使用 shell 一样。 I find it really useful to write python scripts instead of bash scripts.我发现编写 python 脚本而不是 bash 脚本非常有用。

So you could use your main notebook to run your other notebooks as independent sub-processes in parallel with subprocesses.run(your_function_with_papermill) .因此,您可以使用主笔记本将其他笔记本作为独立的子进程与subprocesses.run(your_function_with_papermill)并行运行。

The solution解决方案

I implemented a parallel Jupyter notebook executor using a ProcessPoolExecutor (which uses multiprocessing under the hood).我使用ProcessPoolExecutor (它在后台使用多处理)实现了一个并行 Jupyter 笔记本执行器。 If you want to adapt it to your code, here's the implementation .如果你想让它适应你的代码,这里是实现 This is a general executor, so there are a bunch of things that do not apply to your use case.这是一个通用执行器,所以有很多东西不适用于您的用例。

If you want to use the library, here's a snippet you can use:如果您想使用该库,可以使用以下代码段:

from pathlib import Path
from ploomber import DAG
from ploomber.tasks import NotebookRunner
from ploomber.products import File
from ploomber.executors import Parallel

dag = DAG(executor=Parallel())


engine = None

NotebookRunner(
    Path('input.ipynb'),
    File('output-1.ipynb'),
    dag=dag,
    name='one',
    papermill_params={'engine': engine})


NotebookRunner(
    Path('input.ipynb'),
    File('output-2.ipynb'),
    dag=dag,
    name='two',
    papermill_params={'engine': engine})

Note: as of Ploomber 0.20, notebooks must have a "parameters" cell (you can add an empty one).注意:从 Ploomber 0.20 开始,笔记本必须有一个“参数”单元格(您可以添加一个空的)。 See instructions here. 请参阅此处的说明。

The explanation说明

These issues of executing notebooks in parallel (or notebooks inside notebooks come from the way papermill executes them. It spins up a kernel, and the kernel process is the one running your code; the papermill process only sends messages and waits for responses.这些并行执行笔记本的问题(或笔记本内部的笔记本来自 Papermill 执行它们的方式。它启动了 kernel,kernel 进程是运行您的代码的进程;papermill 进程仅发送消息并等待响应。

This became a problem in a recent project (I need to monitor resource usage) so I wrote a custom papermill engine that executes notebooks in the same process.这在最近的一个项目中成为一个问题(我需要监控资源使用情况),所以我编写了一个自定义的造纸引擎,在同一进程中执行笔记本。 This is another option you can try:这是您可以尝试的另一种选择:

pip install papermill ploomber-engine
papermill input.ipynb output.ipynb --engine profiling

Or from Python:或来自 Python:

import papermill as pm

pm.execute_notebook('input.ipynb', 'output.ipynb', engine='profiling')

(or, you can change engine=None to engine='profiling' in the first example) (或者,您可以在第一个示例中将engine=None更改为engine='profiling'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM