简体   繁体   中英

python generator: unpack entire generator in parallel

Suppose I have a generator whose __next__() function is somewhat expensive and I want to try to parallelize the calls. Where do I throw in the parallization?

To be slightly more concrete, consider this example:

# fast, splitting a file for example
raw_blocks = (b for b in block_generator(fin))
# slow, reading blocks, checking values ...
parsed_blocks = (block_parser(b) for b in raw_blocks)
# get all parsed blocks into a data structure
data = parsedBlocksToOrderedDict(parsed_blocks)

The most basic thing is to change the 2nd line to something that does the parallelization. Is there some generator magic that allows one to unpack the generator (on the 3rd) line in parallel? Calling __next__() in parallel?

No. You must call next() sequentially because any non-trivial generator's next state is determined by its current state.

def gen(num):
    j=0
    for i in xrange(num):
        j += i
        yield j

There's no way to parallelize calls to the above generator without knowing its state at each point it yields a value. But if you knew that, you wouldn't need to run it.

Assuming the calls to block_parser(b) to be performed in parallel, you could try using a multiprocessing.Pool :

import multiprocessing as mp

pool = mp.Pool()

raw_blocks = block_generator(fin)
parsed_blocks = pool.imap(block_parser, raw_blocks)
data = parsedBlocksToOrderedDict(parsed_blocks)

Note that:

  • If you expect that list(parsed_blocks) can fit entirely in memory, then using pool.map can be much faster than pool.imap .
  • The items in raw_blocks and the return values from block_parse must be pickable since mp.Pool transfers tasks and results through a mp.Queue .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM