this is my first multiprocessing implementation, i have executed my code in sequential approach and it took me a minute to process around 30seconds to process 20 records. But i created a dictionary with each key having a set of records, and tried to apply the function using pool.map for every key. Now it is taking more than 2 minute to process though i am assigining each core for each process. Could someone help me to optimize this.
def f(values):
data1 = itertools.combinations(values,2)
tuple_attr =('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num', 'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country', 'Probability', 'Id')
new = ((tuple_attr[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]) for pair in data1)
skt = set(frozenset(temp) for temp in new)
newset = set(s for s in skt if not any(p < s for p in skt))
empty = frozenset(" ")
tr_x = set(frozenset(i) for i in empty)
tr = set(frozenset(i) for i in empty)
for e in newset:
tr.clear()
tr = tr.union(tr_x)
tr_x.clear()
for x in tr:
for a in e:
if x == empty:
tmp = frozenset(frozenset([a]))
tr_x = tr_x.union([tmp])
else :
tmp = frozenset(frozenset([a]).union(x))
tr_x = tr_x.union([tmp])
tr.clear()
tr = tr.union(tr_x)
tr = set(l for l in tr if not any(m < l for m in tr))
return tr
def main():
p = Pool(len(data)) #number of processes = number of CPUs
keys, values= zip(*data.items()) #ordered keys and values
processed_values= p.map( f, values )
result= dict( zip(keys, processed_values ) )
p.close() # no more tasks
p.join() # wrap up current tasks
print(result)
if __name__ == '__main__':
import csv
dicchunk = {*****} #my dictionary
main()
I created a test program to run this once with multiprocessing
, and once without:
def main(data):
p = Pool(len(data)) #number of processes = number of CPUs
keys, values= zip(*data.items()) #ordered keys and values
start = time.time()
processed_values= p.map( f, values )
result= dict( zip(keys, processed_values ) )
print("multi: {}".format(time.time() - start))
p.close() # no more tasks
p.join() # wrap up current tasks
start = time.time()
processed_values = map(f, values)
result2 = dict( zip(keys, processed_values ) )
print("non-multi: {}".format(time.time() - start))
assert(result == result2)
Here's the output:
multi: 191.249588966
non-multi: 225.774535179
multiprocessing
is faster, but not by as much as you might expect. The reason for that is some of the sub-lists take much (several minutes) longer to finish than others. You'll never be faster than however long it takes to process the largest sub-list.
I added some tracing to the worker function to demonstrate this. I saved the time at the start of the worker, and the print it out at the end. Here's the output:
<Process(PoolWorker-4, started daemon)> is done. Took 0.940237998962 seconds
<Process(PoolWorker-2, started daemon)> is done. Took 1.28068685532 seconds
<Process(PoolWorker-1, started daemon)> is done. Took 42.9250118732 seconds
<Process(PoolWorker-3, started daemon)> is done. Took 193.635578156 seconds
As you can see, the workers are doing very unequal amounts of work, so you're only saving about 44 seconds vs being sequential.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.