简体   繁体   中英

python - multiprocessing is slower than sequential

this is my first multiprocessing implementation, i have executed my code in sequential approach and it took me a minute to process around 30seconds to process 20 records. But i created a dictionary with each key having a set of records, and tried to apply the function using pool.map for every key. Now it is taking more than 2 minute to process though i am assigining each core for each process. Could someone help me to optimize this.

def f(values):
    data1 = itertools.combinations(values,2)
    tuple_attr =('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num', 'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country', 'Probability', 'Id')
    new = ((tuple_attr[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]) for pair in data1)
    skt = set(frozenset(temp) for temp in new)
    newset = set(s for s in skt if not any(p < s for p in skt))

    empty = frozenset(" ")
    tr_x = set(frozenset(i) for i in empty)
    tr = set(frozenset(i) for i in empty)
    for e in newset:
        tr.clear()
        tr = tr.union(tr_x)
        tr_x.clear()
        for x in tr:
            for a in e:
                if x == empty:
                    tmp = frozenset(frozenset([a]))
                    tr_x = tr_x.union([tmp])
                else : 
                    tmp = frozenset(frozenset([a]).union(x))
                    tr_x = tr_x.union([tmp])
        tr.clear()
        tr = tr.union(tr_x)
        tr = set(l for l in tr if not any(m < l for m in tr))

    return tr

def main():
    p = Pool(len(data)) #number of processes = number of CPUs
    keys, values= zip(*data.items()) #ordered keys and values
    processed_values= p.map( f, values )
    result= dict( zip(keys, processed_values ) ) 
    p.close() # no more tasks
    p.join()  # wrap up current tasks
    print(result)


if __name__ == '__main__':
    import csv
    dicchunk = {*****} #my dictionary
    main()

I created a test program to run this once with multiprocessing , and once without:

def main(data):
    p = Pool(len(data)) #number of processes = number of CPUs
    keys, values= zip(*data.items()) #ordered keys and values
    start = time.time()
    processed_values= p.map( f, values )
    result= dict( zip(keys, processed_values ) ) 
    print("multi: {}".format(time.time() - start))
    p.close() # no more tasks
    p.join()  # wrap up current tasks

    start = time.time()
    processed_values = map(f, values)
    result2 = dict( zip(keys, processed_values ) ) 
    print("non-multi: {}".format(time.time() - start))
    assert(result == result2)

Here's the output:

multi: 191.249588966
non-multi: 225.774535179

multiprocessing is faster, but not by as much as you might expect. The reason for that is some of the sub-lists take much (several minutes) longer to finish than others. You'll never be faster than however long it takes to process the largest sub-list.

I added some tracing to the worker function to demonstrate this. I saved the time at the start of the worker, and the print it out at the end. Here's the output:

<Process(PoolWorker-4, started daemon)> is done. Took 0.940237998962 seconds
<Process(PoolWorker-2, started daemon)> is done. Took 1.28068685532 seconds
<Process(PoolWorker-1, started daemon)> is done. Took 42.9250118732 seconds
<Process(PoolWorker-3, started daemon)> is done. Took 193.635578156 seconds

As you can see, the workers are doing very unequal amounts of work, so you're only saving about 44 seconds vs being sequential.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM