[英]python - multiprocessing is slower than sequential
this is my first multiprocessing implementation, i have executed my code in sequential approach and it took me a minute to process around 30seconds to process 20 records. 这是我的第一个多处理实现,我已经按顺序执行了我的代码,花了我一分钟的时间来处理大约20秒的时间来处理20条记录。 But i created a dictionary with each key having a set of records, and tried to apply the function using pool.map for every key.
但是我创建了一个字典,每个键都有一组记录,并尝试对每个键使用pool.map来应用该功能。 Now it is taking more than 2 minute to process though i am assigining each core for each process.
现在,尽管我要为每个过程提供每个核心,但要花费2分钟以上的时间来进行处理。 Could someone help me to optimize this.
有人可以帮我优化这一点。
def f(values):
data1 = itertools.combinations(values,2)
tuple_attr =('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num', 'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country', 'Probability', 'Id')
new = ((tuple_attr[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]) for pair in data1)
skt = set(frozenset(temp) for temp in new)
newset = set(s for s in skt if not any(p < s for p in skt))
empty = frozenset(" ")
tr_x = set(frozenset(i) for i in empty)
tr = set(frozenset(i) for i in empty)
for e in newset:
tr.clear()
tr = tr.union(tr_x)
tr_x.clear()
for x in tr:
for a in e:
if x == empty:
tmp = frozenset(frozenset([a]))
tr_x = tr_x.union([tmp])
else :
tmp = frozenset(frozenset([a]).union(x))
tr_x = tr_x.union([tmp])
tr.clear()
tr = tr.union(tr_x)
tr = set(l for l in tr if not any(m < l for m in tr))
return tr
def main():
p = Pool(len(data)) #number of processes = number of CPUs
keys, values= zip(*data.items()) #ordered keys and values
processed_values= p.map( f, values )
result= dict( zip(keys, processed_values ) )
p.close() # no more tasks
p.join() # wrap up current tasks
print(result)
if __name__ == '__main__':
import csv
dicchunk = {*****} #my dictionary
main()
I created a test program to run this once with multiprocessing
, and once without: 我创建了一个测试程序来使用
multiprocessing
运行一次,一次不运行:
def main(data):
p = Pool(len(data)) #number of processes = number of CPUs
keys, values= zip(*data.items()) #ordered keys and values
start = time.time()
processed_values= p.map( f, values )
result= dict( zip(keys, processed_values ) )
print("multi: {}".format(time.time() - start))
p.close() # no more tasks
p.join() # wrap up current tasks
start = time.time()
processed_values = map(f, values)
result2 = dict( zip(keys, processed_values ) )
print("non-multi: {}".format(time.time() - start))
assert(result == result2)
Here's the output: 这是输出:
multi: 191.249588966
non-multi: 225.774535179
multiprocessing
is faster, but not by as much as you might expect. multiprocessing
速度更快,但速度却不如您预期。 The reason for that is some of the sub-lists take much (several minutes) longer to finish than others. 其原因是一些子列表花费太多 (几分钟)长于他人来完成。 You'll never be faster than however long it takes to process the largest sub-list.
您将永远不会比处理最大的子列表所花的时间要快。
I added some tracing to the worker function to demonstrate this. 我向worker函数添加了一些跟踪以演示这一点。 I saved the time at the start of the worker, and the print it out at the end.
我在工作人员开始时节省了时间,并在结束时将其打印出来。 Here's the output:
这是输出:
<Process(PoolWorker-4, started daemon)> is done. Took 0.940237998962 seconds
<Process(PoolWorker-2, started daemon)> is done. Took 1.28068685532 seconds
<Process(PoolWorker-1, started daemon)> is done. Took 42.9250118732 seconds
<Process(PoolWorker-3, started daemon)> is done. Took 193.635578156 seconds
As you can see, the workers are doing very unequal amounts of work, so you're only saving about 44 seconds vs being sequential. 如您所见,这些工人所做的工作量非常不相等,因此与顺序操作相比,您仅节省了大约44秒。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.