Preventing thread duplication in a multiprocessed Python app

Question

I have a Python app that runs multiple jobs in sub-processes launched by multiprocessing.Process . The parent app also launches a thread to report progress to a database. However, I've noticed that if any of the jobs launch sub-processes of their own, they duplicate this thread, causing data corruption in the database. eg when a grand-child subprocess completes, its thread marks the parent job as complete in the database, because it thinks it is the parent process, even though the parent process is still running.

How would I use multiprocess.Process so that it doesn't copy any currently running threads? Is the simplest option to record the original PID in my thread, and if the "current" PID doesn't match this value, then to immediately exit?

I saw this similar question posted last year, but it seems to have been ignored.

Answer 1

Your description of the problem suggests that a background thread in the parent process continues to exist and execute in the child process. That's not possible; at least, it isn't possible on POSIX systems. What's happening in your case is something else. I'll offer some speculation about that below, followed by a suggestion of how to avoid the issue. Taking these points in turn...

1. Only one thread survives forking.

Only the thread that calls fork() is still alive after forking. Here's a small example demonstrating that other threads don't continue execution in the child process:

def output():
    time.sleep(3)
    print "Thread executing in process: %d" % os.getpid()

thread = threading.Thread(target=output)
thread.start()
os.fork()
print "Pid: %d" % os.getpid()

You'll see both parent and child print their pid to stdout, but the second thread will produce output only in the parent.

So having the monitor thread check its pid or in some other way condition on which process it's running in won't make a difference; that thread is only ever executing in one process.

2. Some ways in which forking can cause problems like what you're seeing.

Forking can cause corruption of program state in various ways. For example:

Objects referenced in a thread that dies as a result of forking can go out of scope and thus have their finalizers invoked. This can cause problems if, say, such an object represents a network resource and invocation of its del method causes one end of a connection to be unexpectedly closed.
Any buffered IO will cause problems, because the buffers are duplicated in the child process.

Note that the second point doesn't even require threading. Consider the following:

f = open("testfile", "w", 1024)
f.write("a")
os.fork()

We wrote one character to testfile , and we did so in the parent before forking. But we forked while that content remained unflushed, and so:

alp:~ $ wc -c testfile
      2 testfile

The file contains two characters because the output buffer was copied to the child, and both parent and child eventually flushed their buffers.

I suspect your problems are caused by something like this second issue (although I happily admit that this is pure speculation).

3. Re-architecting to avoid such problems.

You mentioned in a comment that you can't start the monitor thread after spawning your workers because you need to repeatedly create new workers. It might be easier than you think to restructure what you're doing so as to avoid that. Instead of spawning a process for each new unit of work, create a set of long-lived workers managed by a controlling process: The controller feeds a queue with specifications of the jobs that need to be processed; it does so at its leisure. Each worker loops indefinitely, drawing jobs from the queue when they arrive and executing them. (The queue implementation from multiprocessing will guarantee that each job description is drawn by only one worker.) You thus only need to spawn workers once, early on, and can create the monitor thread after all forking is complete.

Here's a schematic example of that kind of organization:

from multiprocessing import Process, Queue

def work(q):
    while True:
        job = q.get()
        if job is None:
            # We've been signaled to stop.
            break
        do_something_with(job)

queue = Queue()
NUM_WORKERS = 3
NUM_JOBS = 20

# Start workers.
for _ in range(NUM_WORKERS):
    p = Process(target=work, args=(queue,))
    p.start()

# Create your monitor thread here.

# Put work in the queue.  This continues as long as you want.
for i in range(NUM_JOBS):
    queue.put(i)

# When there's no more work, put sentinel values in the queue so workers
# know to gracefully exit.
for _ in range(NUM_WORKERS):
    queue.put(None)

Preventing thread duplication in a multiprocessed Python app

Question

1 answers

solution1
2 2014-03-21 05:56:38

Preventing thread duplication in a multiprocessed Python app

Question

1 answers

solution1 2 2014-03-21 05:56:38

solution1
2 2014-03-21 05:56:38