Python multiprocessing efficiently on Windows

Question

Say we split a number into separate domains: Ex: 100 split into: [0, 25] [25, 50] [50, 75] [75, 100]. Then we send each one of those 4 lists out to one of 4 separate process' to be computed and then recombined for an answer as a single split unit of the number 100. We iterate this many times in a row with the need for the process' to work as a unit for 1000's of numbers split into separate domains similar to [0, 25] [25, 50] [50, 75] [75, 100]. An efficiency problem occurs if we have to close the process' to make them act as a single group unit that was processed for an answer. Since windows is crap compared to Unix for running process' we are forced to use the "spawn" method instead of fork. The spawn method is slow at spawning process' so I figured then why not just keep the process' open and pass data to and from them without the need to open and close them for every iteration group of parallel process'. The example code below will do this. It will keep process' open as Class Consumers that will constantly request with run() (in a while loop) for a next_task with the.get() Joinable queue:

import multiprocessing


class Consumer(multiprocessing.Process):

    def __init__(self, task_queue, result_queue):
        multiprocessing.Process.__init__(self)
        self.task_queue = task_queue
        self.result_queue = result_queue

    def run(self):
        while True:
            next_task = self.task_queue.get()
            if next_task is None:
                # Poison pill shutdown of .get() loop with break
                self.task_queue.task_done()
                break
            answer = next_task()
            self.task_queue.task_done()
            self.result_queue.put(answer)
        return


class Task(object):
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __call__(self):
        for i in range(self.b):
            if self.a % i == 0:
                return 0
        return 1


if __name__ == '__main__':
    # Establish communication queues
    tasks = multiprocessing.JoinableQueue()
    results = multiprocessing.Queue()

    # Number of consumers equal to system cpu_count
    num_consumers = multiprocessing.cpu_count() 
    
    # Make a list of Consumer object process' ready to be opened.
    consumers = [ Consumer(tasks, results) for i in range(num_consumers) ]

    for w in consumers:
        w.start()

    # Enqueue jobs for the Class Consumer process' run() while-loop to .get() a workload:
    num_jobs = 10
    for i in range(num_jobs):
        tasks.put(Task(i, 100)) # Similar jobs would be reiterated before poison pill.

    # We start to .get() the results in a different loop-
    for _ in range(num_jobs):  # -so the above loop enqueues all jobs without- 
        result = results.get() # -waiting for the previous .put() to .get() first.
   
    # Add a poison pill for each consumer
    for i in range(num_consumers): # We only do this when all computation is done.
        tasks.put(None) # Here we break all loops of open Consumer enqueue-able process'.

This code is just an example. In other variations of this code: when implementing more iterations of tasks.put() and results.get(), there is need for a way to cause an enqueued Task(object) to return by an outside call before it fully calculates an answer and returns by itself. This would free up resources if you already got your answer from one of the other process' of that single split number group. The __call__ descriptor needs to be present for the Task(object) to work as a function to call tasks.put(Task(i, 100)). I have spent the last 2 weeks trying to figure an efficient way of doing this. Do I need to take a whole different approach? Don't misunderstand my dilemma, I am working with code that works, but not as efficient as I would like on Microsoft Windslows. Any help would be greatly appreciated.

Doesn't a Task(object) exist in the same process as the Consumer() process that enqueued it? If so, couldn't I tell all process' of Class Consumer() Run() to stop their currently running Task(object) without closing their while loops (with the poison pill) so that they can instantly accept another Task() without the need to close and reopen their process again? It really does add up and waste time when you are opening and closing thousands of process' for an iterated calculation. I have tried using Events() Managers() other Queues(). There doesn't seem to be an efficient way to externally intervene on the Task(object) to instantly return to its parent Consumer() so that it doesn't keep wasting resources computing if one of the other Consumers() returned an answer that makes computation of other Consumer() Tasks irrelevant since they are all working as a unified computation of a single number split into groups.

Answer 1

What you have done is implemented your own multiprocessing pool, but why? Were you not aware of the existence of the concurrent.futures.ProcessPoolExecutor and multiprocessing.pool.Pool classes, the latter actually the one better suited for your particular problem?

Both classes implement multiprocessing pools and various methods for submitting tasks to the pool and getting results back from these tasks. But since in your particular case the tasks you are submitting are attempting to solve the same problem and you are only interested in the first available result and once you have this you are done, you need to be able to terminate any remaining running tasks. Only multiprocessing.pool.Pool allows you to do that.

The following code uses method Pool.apply_async to submit a task. This function does not block but rather returns an AsyncResult instance that has a blocking get method that you can call to get the result from the submitted task. But since in general you might be submitting many tasks we don't know which one of these instances to call get on. So the solution is to instead use the callback argument of apply_async specifying a function that will be asynchronously called with a returned value from a task whenever it becomes available. Then the problem becomes communicating this result back. There are two ways:

Method 1: via global variable

from multiprocessing import Pool
import time


def worker1(x):
    time.sleep(3) # emulate working on the problem
    return 9 # the solution

def worker2(x):
    time.sleep(1) # emulate working on the problem
    return 9 # the solution

def callback(answer):
    global solution
    # gets all the returned results from submitted tasks
    # since we are just interested in the first returned result, write it to the queue:
    solution = answer
    pool.terminate() # kill all tasks


if __name__ == '__main__':
    t = time.time()
    pool = Pool(2) # just two processes in the pool for demo purposes
    # submit two tasks:
    pool.apply_async(worker1, args=(1,), callback=callback)
    pool.apply_async(worker2, args=(2,), callback=callback)
    # wait for all tasks to terminate:
    pool.close()
    pool.join()
    print(solution)
    print('Total elapsed time:', time.time() - t)

Prints:

9
Total elapsed time: 1.1378364562988281

Method 2: via a Queue

from multiprocessing import Pool
from queue import Queue
import time


def worker1(x):
    time.sleep(3) # emulate working on the problem
    return 9 # the solution

def worker2(x):
    time.sleep(1) # emulate working on the problem
    return 9 # the solution

def callback(solution):
    # gets all the returned results from submitted tasks
    # since we are just interested in the first returned result, write it to the queue:
    q.put_nowait(solution)


if __name__ == '__main__':
    t = time.time()
    q = Queue()
    pool = Pool(2) # just two processes in the pool for demo purposes
    # submit two tasks:
    pool.apply_async(worker1, args=(1,), callback=callback)
    pool.apply_async(worker2, args=(2,), callback=callback)
    # wait for first returned result from callback:
    solution = q.get()
    print(solution)
    pool.terminate() # kill all tasks in the pool
    print('Total elapsed time:', time.time() - t)

Prints:

9
Total elapsed time: 1.1355643272399902

Update

Even under Windows, the time to create and re-create the pool may be relatively insignificant compared to the time that the tasks require to complete, especially for later iterations, ie larger values of n . If you are calling the same worker function, then a third way is to use pool method imap_unordered . I also include some code that measures my desktop what the overhead of starting a new pool instance is:

from multiprocessing import Pool
import time


def worker(x):
    time.sleep(x) # emulate working on the problem
    return 9 # the solution


if __name__ == '__main__':
    # POOLSIZE = multiprocessing.cpu_count()
    POOLSIZE = 8 # on my desktop
    # how long does it take to start a pool of size 8?
    t1 = time.time()
    for i in range(16):
        pool = Pool(POOLSIZE)
        pool.terminate()
    t2 = time.time()
    print('Average pool creation time: ', (t2 - t1) / 16)

    # POOLSIZE number of calls:
    arguments = [7, 6, 1, 3, 4, 2, 9, 6]
    pool = Pool(POOLSIZE)
    t1 = time.time()
    results = pool.imap_unordered(worker, arguments)
    it = iter(results)
    first_solution = next(it)
    t2 = time.time()
    pool.terminate()
    print('Total elapsed time:', t2 - t1)
    print(first_solution)

Prints:

Average pool creation time:  0.053139880299568176
Total elapsed time: 1.169790506362915
9

Update 2

Here is the dilemma: You have multiple processes working on pieces of a puzzle. As soon as one process discovers for example that a number is divisible by one of the numbers in a passed range, there is no point in the other processes that are testing numbers in different ranges from completing its testing. You can do one of three things. You can do nothing and let the processes finish before starting the next iteration. But that that delays the next iteration. I have already suggested that you terminate the processes, which frees up the processors. But that requires you to create new processes, which you find unsatisfactory.

I can only think of one other possibility, which I present below using your method for multiprocessing. A multiprocessing shared memory variable named stop is initialized with each process as a global variable and set to 0 before each iteration. When a task is set to return a value of 0 and there is no point in other tasks running in other processes continuing, it sets the value of stop to 1. This means that periodically tasks must inspect the value of stop and return if it has been set to 1. This, of course, will add additional cycles to the processing. In the demo below I actually have 100 tasks queued up for 8 processors. But the last 92 tasks will discover immediately that stop has been set and should return on the first iteration.

Just as an aside: The original code used a multiprocessing.JoinableQueue instance for queueing the tasks rather than a multiprocessing.Queue and calls to task_done were made on this instance as messages were taken off the queue. Yet no call to join was ever made on this queue (which would tell you when all the messages had been taken off), thus defeating the whole purpose of having such a queue. In fact, there is no need for a JoinableQueue since the main process has submitted num_jobs jobs and is expecting num_jobs messages on the results queue and can just loop and pull the expected number of results from the results queue. I've substituted a simple Queue for the JoinableQueue leaving the original code in place but commented out. Also, the Consumer processes could be created as daemon processes (with argument daemon=True ) and then they would automatically terminate when all non-daemon processes, ie the main process, terminate and thus obviating the need for using the special "poison-pill" None task messages. I have made that change and again left the original code intact but commented out for comparison.

import multiprocessing


class Consumer(multiprocessing.Process):

    def __init__(self, task_queue, result_queue, stop):
        # make ourself a daemon process:
        multiprocessing.Process.__init__(self, daemon=True)
        self.task_queue = task_queue
        self.result_queue = result_queue
        self.stop = stop

    def run(self):
        global stop
        stop = self.stop
        while True:
            next_task = self.task_queue.get()
            """
            if next_task is None:
                # Poison pill shutdown of .get() loop with break
                #self.task_queue.task_done()
                break
            """
            answer = next_task()
            #self.task_queue.task_done()
            self.result_queue.put(answer)
        # return


class Task(object):
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __call__(self):
        global stop
        # start the range from 1 to avoid dividing by 0:
        for i in range(1, self.b):
            # how frequently should this check be made?
            if stop.value == 1:
                return 0
            if self.a % i == 0:
                stop.value = 1
                return 0
        return 1


if __name__ == '__main__':
    # Establish communication queues
    #tasks = multiprocessing.JoinableQueue()
    tasks = multiprocessing.Queue()
    results = multiprocessing.Queue()

    # Number of consumers equal to system cpu_count
    num_consumers = multiprocessing.cpu_count()

    # Make a list of Consumer object process' ready to be opened.
    stop = multiprocessing.Value('i', 0)
    consumers = [ Consumer(tasks, results, stop) for i in range(num_consumers) ]

    for w in consumers:
        w.start()

    # Enqueue jobs for the Class Consumer process' run() while-loop to .get() a workload:
    # many more jobs than processes, but they will stop immediately once they check the value of stop.value:
    num_jobs = 100
    stop.value = 0 # make sure it is 0 before an iteration
    for i in range(num_jobs):
        tasks.put(Task(i, 100)) # Similar jobs would be reiterated before poison pill.

    # We start to .get() the results in a different loop-
    results = [results.get() for _ in range(num_jobs)]
    print(results)
    print(0 in results)

    """
    # Add a poison pill for each consumer
    for i in range(num_consumers): # We only do this when all computation is done.
        tasks.put(None) # Here we break all loops of open Consumer enqueue-able process'.
    """

Prints:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
True

Answer 2

I finally figured out a solution!

import multiprocessing


class Consumer(multiprocessing.Process):

    def __init__(self, task_queue, result_queue, state):
        multiprocessing.Process.__init__(self)
        self.task_queue = task_queue
        self.result_queue = result_queue
        self.state = state


    def run(self):
        while True:
            next_task = self.task_queue.get()
            if next_task is None:
                self.task_queue.task_done()
                break
            # answer = next_task() is where the Task object is being called.
            # Python runs on a line per line basis so it stops here until assigned.
            # Put if-else on the same line so it quits calling Task if state.is.set()
            answer = next_task() if self.state.is_set() is False else 0
            self.task_queue.task_done()
            self.result_queue.put(answer)
        return


class Task(object):
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __call__(self):
        for i in range(self.b):
            if self.a % i == 0:
                return 0
        return 1


def initialize(n_list, tasks, results, states):
    sum_list = []
    for i in range(cpu_cnt):
    tasks.put(Task(n_list[i], number))
    for _ in range(cpu_cnt):
        sum_list.append(int(results.get()))
        if 0 in sum_list:
            states.set()
    if 0 in sum_list:
        states.clear()
        return None
    else:
        states.clear()
        return number


if __name__ == '__main__':
    states = multiprocessing.Event() # ADD THIS BOOLEAN FLAG EVENT!
    tasks = multiprocessing.JoinableQueue()
    results = multiprocessing.Queue()

    cpu_cnt = multiprocessing.cpu_count() 

    # Add states.Event() to Consumer argument list:
    consumers = [ Consumer(tasks, results, states) for i in range(cpu_cnt) ]

    for w in consumers:
        w.start()

    n_list = [x for x in range(1000)]
    iter_list = []
    for _ in range(1000):
        iter_list.append(initialize(n_list, tasks, results, states)


    for _ in range(num_jobs):
        result = results.get()
   

    for i in range(num_consumers):
        tasks.put(None)

If the opened Consumer objects assign the answer to the next_task() function call with an if-else statement on the same line, then it will quit when the state.Event() flag is set because its locked to that line until the variable "answer" is assigned by the queued Task object. Its a great workaround. It makes the Task object interruptible in the while loop of the Consumer that is running it through the variable "answer" assignment, Over 2 weeks and I found a solution that speeds things up! I tested it on a working version of the code and it is faster! With a method like this, it is possible to open a number of processes indefinitely and pass many different Task objects through the Consumer object joinable Queue loop and get mass amounts of data parallel processed at blazing speeds! Its almost like this code makes all cores function like one "mega core" where all processes stay open for each core and work together for any desired iterative stream of i/o!

Here is a sample output from one of my python multiprocessing programs implementing this method on 8 hyperthreaded cores:

Enter prime number FUNCTION:n+n-1
Enter the number for 'n' START:1
Enter the number of ITERATIONS:100000
Progress: ########## 100%
Primes:
ƒ(2) = 3
ƒ(3) = 5
ƒ(4) = 7
ƒ(6) = 11
ƒ(7) = 13
ƒ(9) = 17

etc etc...

ƒ(99966) = 199931
ƒ(99967) = 199933
ƒ(99981) = 199961
ƒ(99984) = 199967
ƒ(100000) = 199999
Primes found: 17983
Prime at end of list has 6 digits.
Overall process took 1 minute and 2.5 seconds.

All 17983 primes from 1 to 200,000 (except for #2) full modulus in ~ 1 minute

On a 3990x 128 thread AMD Threadripper it would take ~8 seconds.

Here is another output on 8 hyperthreaded cores:

Enter prime number FUNCTION:((n*2)*(n**2)**2)+1
Enter the number for 'n' START:1
Enter the number of ITERATIONS:1000
Progress: ########## 100%
Primes:
ƒ(1) = 3
ƒ(3) = 487
ƒ(8) = 65537
    
    etc... etc...
    
ƒ(800) = 655360000000001
ƒ(839) = 831457011176399
ƒ(840) = 836423884800001
ƒ(858) = 929964638281537
ƒ(861) = 946336852720603
ƒ(884) = 1079670712526849
ƒ(891) = 1123100229130903
ƒ(921) = 1325342566697203
ƒ(953) = 1572151878119987
ƒ(959) = 1622269605897599
ƒ(983) = 1835682572370287
Primes found: 76
Prime at end of list has 16 digits.
Overall process took 1 minute and 10.6 seconds.

Python multiprocessing efficiently on Windows

Question

2 answers

solution1
3 2021-02-28 16:49:19

solution2
0 2021-03-03 05:10:51

Python multiprocessing efficiently on Windows

Question

2 answers

solution1 3 2021-02-28 16:49:19

solution2 0 2021-03-03 05:10:51

solution1
3 2021-02-28 16:49:19

solution2
0 2021-03-03 05:10:51