Single process code performs faster than Multiprocessing - MCVE

Question

My attempt to speed up one of my applications using Multiprocessing resulted in lower performance. I am sure it is a design flaw, but that is the point of discussion- How to better approach this problem in order to take advantage of multiprocessing.

My current results on a 1.4ghz atom:

SP Version = 19 seconds
MP Version = 24 seconds

Both versions of code can be copied and pasted for you to review. The dataset is at the bottom and can be pasted also. (I decided against using xrange to illustrate the problem)

First the SP version:

*PASTE DATA HERE*    

def calc():
    for i, valD1 in enumerate(D1):
        for i, valD2 in enumerate(D2):
            for i, valD3 in enumerate(D3):  
                for i, valD4 in enumerate(D4):
                    for i, valD5 in enumerate(D5):
                        for i, valD6 in enumerate(D6):
                            for i, valD7 in enumerate(D7):
                                sol1=float(valD1[1]+valD2[1]+valD3[1]+valD4[1]+valD5[1]+valD6[1]+valD7[1])
                                sol2=float(valD1[2]+valD2[2]+valD3[2]+valD4[2]+valD5[2]+valD6[2]+valD7[2])
    return None

print(calc())

Now the MP version:

import multiprocessing
import itertools

*PASTE DATA HERE*

def calculate(vals):
    sol1=float(valD1[0]+valD2[0]+valD3[0]+valD4[0]+valD5[0]+valD6[0]+valD7[0])
    sol2=float(valD1[1]+valD2[1]+valD3[1]+valD4[1]+valD5[1]+valD6[1]+valD7[1])
    return none

def process():
    pool = multiprocessing.Pool(processes=4)
    prod = itertools.product(([x[1],x[2]] for x in D1), ([x[1],x[2]] for x in D2), ([x[1],x[2]] for x in D3), ([x[1],x[2]] for x in D4), ([x[1],x[2]] for x in D5), ([x[1],x[2]] for x in D6), ([x[1],x[2]] for x in D7))
    result = pool.imap(calculate, prod, chunksize=2500)
    pool.close()
    pool.join()
    return result

if __name__ == "__main__":    
    print(process())

And the data for both:

D1 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D2 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D3 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D4 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D5 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D6 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D7 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]

And now the theory:

Since there is little actual work (just summing 7 ints) there is too much CPU bound data and Interprocess Communication creates too much overhead to make Multiprocessing effective. This seems like a situation where I really need the ability to multithread. So at this point I am looking for suggestions before I try this on a different language because of the GIL.

********Debugging

File "calc.py", line 309, in <module>
    smart_calc()
  File "calc.py", line 290, in smart_calc
    results = pool.map(func, chunk_list)
  File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
TypeError: sequence index must be integer, not 'slice'

In this case, totallen = 108 and CHUNKS is set to 2. When CHUNKS is reduced to 1, it works.

Answer 1

Ok, I think I've figured out actually get a speed boost from multiprocessing. Since your actual source lists aren't very long, it's reasonable to pass them in their entirety to the worker processes. So, if each worker process has copies of the same source lists, then ideally we'd want all of them iterate over different pieces of the lists in parallel, and just sum up that unique slice. Because we know the size of the input lists, we can accurately determine how long itertools.product(D1, D2, ...) will be, which means we can also accurately determine how big each chunk should be to evenly distribute the work. So, we can provide each worker with a specific range of the itertools.product iterator that they should iterate over and sum:

import math
import itertools
import multiprocessing
import functools

def smart_calc(valD1, valD2, valD3, valD4, valD5, valD6, valD7, slices):
    # Build an iterator over the entire data set
    prod = itertools.product(([x[1],x[2]] for x in valD1), 
                             ([x[1],x[2]] for x in valD2), 
                             ([x[1],x[2]] for x in valD3), 
                             ([x[1],x[2]] for x in valD4), 
                             ([x[1],x[2]] for x in valD5), 
                             ([x[1],x[2]] for x in valD6), 
                             ([x[1],x[2]] for x in valD7))

    # But only iterate over our unique slice
    for subD1, subD2, subD3, subD4, subD5, subD6, subD7 in itertools.islice(prod, slices[0], slices[1]):
        sol1=float(subD1[0]+subD2[0]+subD3[0]+subD4[0]+subD5[0]+subD6[0]+subD7[0])
        sol2=float(subD1[1]+subD2[1]+subD3[1]+subD4[1]+subD5[1]+subD6[1]+subD7[1])
    return None

def smart_process():
    CHUNKS = multiprocessing.cpu_count()  # Number of pieces to break the list into.
    total_len = len(D1) ** 7  # The total length of itertools.product()
    # Figure out how big each chunk should be. Got this from 
    # multiprocessing.map()
    chunksize, extra = divmod(total_len, CHUNKS)
    if extra:
        chunksize += 1

    # Build a list that has the low index and high index for each
    # slice of the list. Each process will iterate over a unique
    # slice
    low = 0 
    high = chunksize
    chunk_list = []
    for _ in range(CHUNKS):
        chunk_list.append((low, high))
        low += chunksize
        high += chunksize

    pool = multiprocessing.Pool(processes=CHUNKS)
    # Use partial so we can pass all the lists to each worker
    # while using map (which only allows one arg to be passed)
    func = functools.partial(smart_calc, D1, D2, D3, D4, D5, D6, D7) 
    result = pool.map(func, chunk_list)
    pool.close()
    pool.join()
    return result

Results:

sequential: 13.9547419548
mp: 4.0270690918

Success! Now, you do have to actually combine the results after you have them, which will add additional overhead to your real program. It might end up making this approach slower than sequential again, but it really depends on what you actually want to do with the data.

Single process code performs faster than Multiprocessing - MCVE

Question

1 answers

solution1
1 ACCPTED 2014-08-11 17:06:57

Single process code performs faster than Multiprocessing - MCVE

Question

1 answers

solution1 1 ACCPTED 2014-08-11 17:06:57

solution1
1 ACCPTED 2014-08-11 17:06:57