简体   繁体   中英

python multiprocessing, big data turn process into sleep

I'm using python 2.7.10. I read lots of files, store them into a big list, then try to call multiprocessing and pass the big list to those multiprocesses so that each process can access this big list and do some calculation.

I'm using Pool like this:

def read_match_wrapper(args):
    args2 = args[0] + (args[1],)
    read_match(*args2)

 pool = multiprocessing.Pool(processes=10)
 result=pool.map(read_match_wrapper,itertools.izip(itertools.repeat((ped_list,chr_map,combined_id_to_id,chr)),range(10)))
 pool.close()
 pool.join()

Basically, I'm passing multiple variables to 'read_match' function. In order to use pool.map, I write 'read_match_wrapper' function. I don't need any results back from those processes. I just want them to run and finish.

I can get this whole process work when my data list 'ped_list' is quite small. When I load all the data, like 10G, then all the multiprocesses that it generates show 'S' and seems not working at all..

I don't know if there is a limit of how much data you can access through pool? I really need help on this! Thanks!

From the multiprocessing Programming guidelines:

Avoid shared state

 As far as possible one should try to avoid shifting large amounts of data between processes. 

What you suffer from is a typical symptom of a full Pipe which does not get drained.

The Python multiprocessing.Pipe used by the Pool has some design flaw. It basically implements a sort of message oriented protocol over an OS pipe which is more like a stream object.

The result is that, if you send a too large object through the Pipe, it will get stuffed. The sender won't be able to add content to it and the receiver won't be able to drain it as it's blocked waiting for the end of the message.

Proof is that your workers are sleeping waiting for that "fat" message which never arrives.

Is ped_list containing the file names or the file contents?

In the second case you'd rather send the file names instead of the content. The workers can retrieve the content themselves with a simple open().

Instead of working with pool.map I would rather use queues. You could spawn the desired number of processes and assign a queue for input:

n = 10 #number of processes
tasks = multiprocessing.Queue()

for i in range(n): #spawn processes
    multiprocessing.Process(target = read_match_wrapper, args = tasks)
for element in ped_list:
    tasks.put(element)

In this way, your queue is filled from one side and at the same time emptied from the other. Maybe it is necessary to put something in the queue before the processes are started. There is a chance that they end without doing anything as the queue is empty or raise a Queue.empty-exception.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM