简体   繁体   中英

python multiprocessing queue implementation

I'm having trouble understanding how to implement queue into a multiprocessing example below. Basically, I want the code to:

1) spawn 2 processes (done)

2) split up my id_list into two portions (done)

3) have each process iterate over the list printing out each item, and only close when its done with the list. I know I have to implement some type of Queueing system, and pass that to each worker, but I'm not sure how to do that. Any help would be much appreciated.

from multiprocessing import Pool,Queue
id_list = [1,2,3,4,5,6,7,8,9,10]

def mp_worker(record):
    try:  
        print record
        sleep(1)
    except: pass
    print "worker closed"

def mp_handler():
    p = Pool(processes = 2) #number of processes
    p.map(mp_worker, id_list)  #devides id_list between 2 processes, defined above
    p.close()
    p.join()

mp_handler()

Note - the code prints out "worker closed" 10 times. Id like for this statement to be printed only twice (once for each worker, after each worker prints out the 5 numbers from id_list)

This works for me (on Python 3). Instead of using a Pool, I spawn my own two processes:

from multiprocessing import Process, Queue
from time import sleep


id_list = [1,2,3,4,5,6,7,8,9,10]

queue = Queue()

def mp_worker(queue):

    while queue.qsize() >0 :
        record = queue.get()
        print(record)
        sleep(1)

    print("worker closed")

def mp_handler():

    # Spawn two processes, assigning the method to be executed 
    # and the input arguments (the queue)
    processes = [Process(target=mp_worker, args=(queue,)) for _ in range(2)]

    for process in processes:
        process.start()
        print('Process started')

    for process in processes:
        process.join()



if __name__ == '__main__':

    for id in id_list:
        queue.put(id)

    mp_handler()

Although the length of the elements to be processed is hardcoded. But it could be a second input argument to for the mp_worker method.

One solution for this question by using Pool and Queue is


    from time import sleep
    from multiprocessing import Pool,Queue
    id_list = [1,2,3,4,5,6,7,8,9,10]

    def mp_worker(q):
        try:  
            print(q.get())
            sleep(.1)
        except: pass
        print ("worker closed")

    if __name__ == "__main__":
        q = Queue()
        p = Pool(processes = 2) #number of processes
        for x in id_list:
            q.put(x)
        p.map(mp_worker, id_list)  #devides id_list between 2 processes, defined above


you must add vaules to Quene by put in main section of your code and in the function read the value from Queue by get

The print statement you have there is misleading you -- the worker process does not terminate at the end of the function. In fact, the worker processes stay alive until the pool is closed. Additionally, multiprocessing already takes care of breaking up the list into chunks and queueing up each task for you.

As for your other question, normally you would pass a callback to map_async if you wanted to trigger an asynchronous event upon the entire list being completed. Calling once per chunk takes some mucking about with the internals, but if you really want to you could:

def mapstar_custom(args):
    result = list(map(*args))
    print "Task completed"
    return result
...

pool._map_async(f, x, mapstar_custom, None, None, None).get()

Edit: we seem to be conflating terminology. When I say worker I mean the processes the pool spawns, whereas you seem to mean the processes Selenium spawns from those processes (which wasn't in your question). Opening the webdriver only once is easy enough: if you have pool.map(module.task, ...) , then in module.py just do:

# ... selenium init here ...

def task(...):
    # ... use webdriver ...

The module will only be imported once by the pool workers, no matter how many times you dispatch that task. So the top level init will happen only once.

Since this is the top Google result for Python Multiprocessing Queue implementation I'm going to post a slightly more generalized example.

Consider the following script:

import time
import math
import pprint

def main():
    print('\n' + 'starting . . .' + '\n')

    startTime = time.time()
    my_list = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
    result_list = []

    for num in my_list:
        result_list.append(squareNum(num))
    # end for

    elapsedTime = time.time() - startTime

    print('result_list: ')
    pprint.pprint(result_list)
    print('\n' + 'program took ' + '{:.2f}'.format(elapsedTime) + ' seconds' + '\n')
# end function

def squareNum(num: float) -> float:
    time.sleep(1.0)
    return math.pow(num, 2)
# end function

if __name__ == '__main__':
    main()

This script declares 10 floats, squares them (sleeping for 1 second upon each square to simulate some significant process), then collects the results in a new list. This takes about 10 seconds to run.

Here is a refactored version using Multiprocessing Process and Queue :

from multiprocessing import Process, Queue
import time
import math
from typing import List
import pprint


def main():
    print('\n' + 'starting . . .' + '\n')

    startTime = time.time()
    my_list = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
    result_list = []

    multiProcQueue = Queue()

    processes: List[Process] = []
    for num in my_list:
        processes.append(Process(target=squareNum, args=(num, multiProcQueue,)))
    # end for

    for process in processes:
        process.start()
    # end for

    for process in processes:
        process.join()
    # end for

    while not multiProcQueue.empty():
        result_list.append(multiProcQueue.get())
    # end for

    elapsedTime = time.time() - startTime

    print('result_list: ')
    pprint.pprint(result_list)
    print('\n' + 'program took ' + '{:.2f}'.format(elapsedTime) + ' seconds' + '\n')
# end function

def squareNum(num: float, multiProcQueue: Queue) -> None:
    time.sleep(1.0)
    result = math.pow(num, 2)
    multiProcQueue.put(result)
# end function

if __name__ == '__main__':
    main()

This script runs in about 1 second. To my knowledge this is the cleanest way of having multiple processes write results in parallel to the same data structure. I wish the documentation https://docs.python.org/3/library/multiprocessing.html had an example like this.

Note the order of the result list will usually not match the order of the input list, a different approach would be needed if order had to be maintained.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM