Where to write parallelized program output to?

Question

I have a program that is using pool.map() to get the values using ten parallel workers. I'm having trouble wrapping my head around how I am suppose to stitch the values back together to make use of it at the end.

What I have is structured like this:

initial_input = get_initial_values()
pool.map(function, initial_input)
pool.close()
pool.join()

# now how would I get the output?
send_ftp_of_output(output_data)

Would I write the function to a log file? If so, if there are (as a hypothetical) a million processes trying to write to the same file, would things overwrite each other?

Answer 1

pool.map(function,input)

returns a list.

You can get the output by doing:

output_data = pool.map(function,input)

pool.map simply runs the map function in paralell, but it still only returns a single list. If you're not outputting anything in the function you are mapping (and you shouldn't), then it simply returns a list. This is the same as map() would do, except it is executed in paralell.

Answer 2

In regards to the log file, yes, having multiple threads right to the same place would interleave within the log file. You could have the thread log the file before the write, which would ensure that something wouldn't get interrupted mid-entry, but it would still interleave things chronologically amongst all the threads. Locking the log file each time also would significantly slow down logging due to the overhead involved.

You can also have, say, the thread number -- %(thread)d -- or some other identifying mark in the logging Formatter output that would help to differentiate, but it could still be hard to follow, especially for a bunch of threads.

Not sure if this would work in your specific application, as the specifics in your app may preclude it, however, I would strongly recommend considering GNU Parallel ( http://www.gnu.org/software/parallel/ ) to do the parallelized work. (You can use, say, subprocess.check_output to call into it).

The benefit of this is several fold, chiefly that you can easily vary the number of parallel workers -- up to having parallel use one worker per core on the machine -- and it will pipeline the items accordingly. The other main benefit, and the one more specifically related to your question -- is that it will stitch the output of all of these parallel workers together as if they had been invoked serially.

If your program wouldn't work so well having, say, a single command line piped from a file within the app and parallelized, you could perhaps make your Python code single-worker and then as the commands piped to parallel , make it a number of permutations of your Python command line, varying the target each time, and then have it output the results.

I use GNU Parallel quite often in conjunction with Python , often to do things, like, say, 6 simultaneous Postgres queries using psql from a list of 50 items.

Answer 3

Using Tritlo's suggestion, here is what worked for me:

def run_updates(input_data):
    # do something
    return {data}

if __name__ == '__main__':

    item = iTunes()
    item.fetch_itunes_pulldowns_to_do()
    initial_input_data = item.fetched_update_info

    pool = Pool(NUM_IN_PARALLEL)
    result = pool.map(run_updates, initial_input_data)
    pool.close()
    pool.join()
    print result

And this gives me a list of results

Where to write parallelized program output to?

Question

3 answers

solution1
2 ACCPTED 2014-07-08 01:06:28

solution2
1 2014-07-08 01:04:03

solution3
0 2014-07-08 01:13:00

Where to write parallelized program output to?

Question

3 answers

solution1 2 ACCPTED 2014-07-08 01:06:28

solution2 1 2014-07-08 01:04:03

solution3 0 2014-07-08 01:13:00

solution1
2 ACCPTED 2014-07-08 01:06:28

solution2
1 2014-07-08 01:04:03

solution3
0 2014-07-08 01:13:00