Python Multiprocessing weird behavior when NOT USING time.sleep()

Question

This is the exact code from Python.org. If you comment out the time.sleep() , it crashes with a long exception traceback. I would like to know why.

And, I do understand why Python.org included it in their example code. But artificially creating "working time" via time.sleep() shouldn't break the code when it's removed. It seems to me that the time.sleep() is affording some sort of spin up time. But as I said, I'd like to know from people who might actually know the answer.

A user comment asked me to fill in more details on the environment this was happening in. It was on OSX Big Sur 11.4. Using a clean install of Python 3.95 from Python.org (no Homebrew, etc). Run from within Pycharm inside a venv. I hope that helps add to understanding the situation.

import time
import random

from multiprocessing import Process, Queue, current_process, freeze_support

#
# Function run by worker processes
#

def worker(input, output):
    for func, args in iter(input.get, 'STOP'):
        result = calculate(func, args)
        output.put(result)

#
# Function used to calculate result
#

def calculate(func, args):
    result = func(*args)
    return '%s says that %s%s = %s' % \
        (current_process().name, func.__name__, args, result)

#
# Functions referenced by tasks
#

def mul(a, b):
    #time.sleep(0.5*random.random())   # <--- time.sleep() commented out
    return a * b

def plus(a, b):
    #time.sleep(0.5*random.random()).  # <--- time.sleep() commented out
    return a + b

#
#
#

def test():
    NUMBER_OF_PROCESSES = 4
    TASKS1 = [(mul, (i, 7)) for i in range(20)]
    TASKS2 = [(plus, (i, 8)) for i in range(10)]

    # Create queues
    task_queue = Queue()
    done_queue = Queue()

    # Submit tasks
    for task in TASKS1:
        task_queue.put(task)

    # Start worker processes
    for i in range(NUMBER_OF_PROCESSES):
        Process(target=worker, args=(task_queue, done_queue)).start()

    # Get and print results
    print('Unordered results:')
    for i in range(len(TASKS1)):
        print('\t', done_queue.get())

    # Add more tasks using `put()`
    for task in TASKS2:
        task_queue.put(task)

    # Get and print some more results
    for i in range(len(TASKS2)):
        print('\t', done_queue.get())

    # Tell child processes to stop
    for i in range(NUMBER_OF_PROCESSES):
        task_queue.put('STOP')


if __name__ == '__main__':
    freeze_support()
    test()

This is the traceback if it helps anyone:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

Answer 1

Here's a technical breakdown.

This is a race condition where the main process finishes, and exits before some of the children have a chance to fully start up. As long as a child fully starts, there are mechanisms in-place to ensure they shut down smoothly, but there's an unsafe in-between time. Race conditions can be very system dependent, as it is up to the OS and the hardware to schedule the different threads, as well as how fast they chew through their work.

Here's what's going on when a process is started... Early on in the creation of a child process, it registers itself in the main process so that it will be either join ed or terminate d when the main process exits depending on if it's daemonic ( multiprocessing.util._exit_function ). This exit function was registered with the atexit module on import of multiprocessing.

Also during creation of the child process, a pair of Pipe s are opened which will be used to pass the Process object to the child interpreter (which includes what function you want to execute and its arguments). This requires 2 file handles to be shared with the child, and these file handles are also registered to be closed using atexit .

The problem arises when the main process exits before the child has a chance to read all the necessary data from the pipe (un-pickling the Process object) during the startup phase. If the main process first closes the pipe, then waits for the child to join , then we have a problem. The child will continue spinning up the new python instance until it gets to the point when it needs to read in the Process object containing your function and arguments it should run. It will try to read from a pipe which has already been closed, which is an error.

If all the children get a chance to fully start-up you won't see this ever, because that pipe is only used for startup. Putting in a delay which will in some way guarantee that all the children have some time to fully start up is what solves this problem. Manually calling join will provide this delay by waiting for the children before any of the atexit handlers are called. Additionally, any amount of processing delay means that q.get in the main thread will have to wait a while which also gives the children time to start up before closing. I was never able to reproduce the problem you encountered, but presumably you saw the output from all the TASKS (" Process-1 says that mul(19, 7) = 133 " ). Only one or two of the child processes ended up doing all the work, allowing the main process to get all the results, and finish up before the other children finished startup.

EDIT:

The error is unambiguous as to what's happening, but I still can't figure how it happens... As far as I can tell, the file handles should be closed when calling _run_finalizers() in _exit_function after joining or terminating all active_children rather than before via _run_finalizers(0)

EDIT2:

_run_finalizers will seemingly actually never call Popen.finalizer to close the pipes, because exitpriority is None . I'm very confused as to what's going on here, and I think I need to sleep on it...

Answer 2

Apparently @user2357112supportsMonica was on the right track. It totally solves the problem if you join the processes before exiting the program. Also @Aaron's answer has the deep knowledge as to why this fixes the issue!

I added the following bits of code as was suggested and it totally fixed the need to have time.sleep() in there.

First I gathered all the processes when they were started:

processes: list[Process] = []
# Start worker processes
for i in range(NUMBER_OF_PROCESSES):
    p = Process(target=worker, args=(task_queue, done_queue))
    p.start()
    processes.append(p)

Then at the end of the program I joined them as follows:

# Join the processes
for p in processes:
    p.join()

Totally solved the issues. Thanks for the advice.

Python Multiprocessing weird behavior when NOT USING time.sleep()

Question

2 answers

solution1
2 ACCPTED 2021-06-10 05:46:43

solution2
1 2021-06-10 03:44:41

Python Multiprocessing weird behavior when NOT USING time.sleep()

Question

2 answers

solution1 2 ACCPTED 2021-06-10 05:46:43

solution2 1 2021-06-10 03:44:41

solution1
2 ACCPTED 2021-06-10 05:46:43

solution2
1 2021-06-10 03:44:41