简体   繁体   中英

Multiprocessing deadlocks during large computation using Pool().apply_async

I have an issue in Python 3.7.3 where my multiprocessing operation (using Queue, Pool, and apply_async) deadlocks when handling large computational tasks.

For small computations, this multiprocessing task works just fine. However, when dealing with larger processes, the multiprocessing task stops, or deadlocks, altogether without exiting the process, I read that this will happen if you "grow your queue without bounds. and you are joining up to a subprocess that is waiting for room in the queue [..,] your main process is stalled waiting for that one to complete. and it never will." ( Process.join() and queue don't work with large numbers )

I am having trouble converting this concept into code. I would greatly appreciate guidance on refactoring the code I have written below:

import multiprocessing as mp

def listener(q, d):  # task to queue information into a manager dictionary
    while True:
        item_to_write = q.get()
        if item_to_write == 'kill':
            break
        foo = d['region']
        foo.add(item_to_write) 
        d['region'] = foo  # add items and set to manager dictionary


def main():
    manager = mp.Manager()
    q = manager.Queue()
    d = manager.dict()
    d['region'] = set()

    pool = mp.Pool(mp.cpu_count() + 2)
    watcher = pool.apply_async(listener, (q, d))
    jobs = []
    for i in range(24):
        job = pool.apply_async(execute_search, (q, d))  # task for multiprocessing
        jobs.append(job)
    for job in jobs:
        job.get()  # begin multiprocessing task
    q.put('kill')  # kill multiprocessing task (view listener function)
    pool.close()
    pool.join()

    print('process complete')


if __name__ == '__main__':
    main()

Ultimately, I would like to prevent deadlocking altogether to facilitate a multiprocessing task that could operate indefinitely until completion.


BELOW IS THE TRACEBACK WHEN EXITING DEADLOCK IN BASH

^CTraceback (most recent call last):
  File "multithread_search_cl_gamma.py", line 260, in <module>
    main(GEOTAG)
  File "multithread_search_cl_gamma.py", line 248, in main
    job.get()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
Process ForkPoolWorker-28:
Process ForkPoolWorker-31:
Process ForkPoolWorker-30:
Process ForkPoolWorker-27:
Process ForkPoolWorker-29:
Process ForkPoolWorker-26:
    self.wait(timeout)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
    with self._rlock:
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
     self._event.wait(timeout)
  File "/Users/Ira/anaconda3/lib/python3.7/threading.py", line 552, in wait
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 352, in get
    res = self._reader.recv_bytes()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
    with self._rlock:
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
    signaled = self._cond.wait(timeout)
  File "/Users/Ira/anaconda3/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt
   with self._rlock:
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
    with self._rlock:
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
    with self._rlock:
  File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

Below is the updated script:

import multiprocessing as mp
import queue

def listener(q, d, stop_event):
    while not stop_event.is_set():
        try:
            while True:
                item_to_write = q.get(False)
                if item_to_write == 'kill':
                    break
                foo = d['region']
                foo.add(item_to_write)
                d['region'] = foo
        except queue.Empty:
            pass

        time.sleep(0.5)
        if not q.empty():
            continue


def main():
    manager = mp.Manager()
    stop_event = manager.Event()
    q = manager.Queue()
    d = manager.dict()
    d['region'] = set()
    pool = mp.get_context("spawn").Pool(mp.cpu_count() + 2)
    watcher = pool.apply_async(listener, (q, d, stop_event))
    stop_event.set()
    jobs = []
    for i in range(24):
        job = pool.apply_async(execute_search, (q, d))
        jobs.append(job)
    for job in jobs:
        job.get()
    q.put('kill')
    pool.close()
    pool.join()
    print('process complete')


if __name__ == '__main__':
    main()

UPDATE::

execute_command executes several processes necessary for search, so I put in code for where q.put() lies.

Alone, the script will take > 72 hrs to finish. Each multiprocess never completes the entire task, rather they work individually and reference a manager.dict() to avoid repeating tasks. These tasks work until every tuple in the manager.dict() has been processed.

def area(self, tup, housing_dict, q):
    state, reg, sub_reg = tup[0], tup[1], tup[2]
    for cat in housing_dict:
        """
        computationally expensive, takes > 72 hours
        for a list of 512 tup(s)
        """
        result = self.search_geotag(
            state, reg, cat, area=sub_reg
            )
    q.put(tup)

The q.put(tup) is ultimately placed in the listener function to add tup to the manager.dict()

Since listener and execute_search are sharing the same queue object, there could be race, where execute_search gets 'kill' from queue before listener does, thus listener will stuck in blocking get() forever, since there are no more new items.

For that case you can use Event object to signal all processes to stop:

import multiprocessing as mp
import queue

def listener(q, d, stop_event):
    while not stop_event.is_set():
        try:
           item_to_write = q.get(timeout=0.1)
           foo = d['region']
           foo.add(item_to_write)
           d['region'] = foo
        except queue.Empty:
            pass
    print("Listener process stopped")

def main():
    manager = mp.Manager()
    stop_event = manager.Event()
    q = manager.Queue()
    d = manager.dict()
    d['region'] = set()
    pool = mp.get_context("spawn").Pool(mp.cpu_count() + 2)
    watcher = pool.apply_async(listener, (q, d, stop_event))
    stop_event.set()
    jobs = []
    for i in range(24):
        job = pool.apply_async(execute_search, (q, d))
        jobs.append(job)
    try:
        for job in jobs: 
            job.get(300) #get the result or throws a timeout exception after 300 seconds
    except multiprocessing.TimeoutError:
         pool.terminate()
    stop_event.set() # stop listener process
    print('process complete')


if __name__ == '__main__':
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM