Python多处理未使用完整的CPU核心

Question

I used regex to check sequence record of a paired-end fastq files and write the matched sequences into the same files. 我使用正则表达式检查配对的fastq文件的序列记录，并将匹配的序列写入相同的文件。 I used multiprocessing to speed up it but when I ran it with 20 processes, the 20 cpu cores were all using 2% and the total time was the same as running in a single core. 我使用多处理来加速它，但是当我用20个进程运行它时，20个cpu内核全都使用2％，总时间与在单个内核中运行相同。 Does it mean the regex search is faster than writing output to file so the processes were waiting? 这是否意味着正则表达式搜索比将输出写入文件更快，因此进程正在等待？ Can you suggest how can I improve the multiprocessing? 您能建议我如何改善多重处理吗？ Attached is the code. 随附的代码。


def mycallback(x):
    SeqIO.write(x[0],outfile1,result.form)
    SeqIO.write(x[1],outfile2,result.form)
    SeqIO.write(x[2],outfile3,result.form)
    SeqIO.write(x[3],outfile4,result.form)

def check(x):
    if regex.search(r'^.{0,20}(?:'+fp+'){e<='+str(result.mm)+'}',str(x[0].seq),flags=regex.I) and regex.search(r'^.{0,20}(?:'+rp+'){e<='+str(result.mm)+'}',str(x[1].seq),flags=regex.I):
    return((x[0],x[1],'',''))
    elif regex.search(r'^.{0,20}(?:'+fp+'){e<='+str(result.mm)+'}',str(x[1].seq),flags=regex.I) and regex.search(r'^.{0,20}(?:'+rp+'){e<='+str(result.mm)+'}',str(x[0].seq),flags=regex.I):
    return((x[1],x[0],'',''))
    else:
    return(('','',x[0],x[1]))

p=Pool(int(result.n))
for i in izip(SeqIO.parse(result.fseq,result.form),SeqIO.parse(result.rseq,result.form)):
    p.apply_async(check,args=(i,),callback=mycallback)

p.close()
p.join()

Answer 1

Python's implementation of pool.apply_async calls the callback function inside a thread inside the main process and is such limited by the GIL . Python的pool.apply_async实现在主进程内部的线程内调用回调函数，并受到GIL的限制。 You are thus waiting on all your file writes sequentially. 因此，您正在等待所有文件顺序写入。

Callbacks should complete immediately since otherwise the thread which handles the results will get blocked. 回调应该立即完成，因为否则处理结果的线程将被阻塞。

I would imagine your regex executes faster than file writing, so you would benefit the most from sending the callbacks to their own threads (so multiple file writes can be queued at once). 我可以想象您的正则表达式的执行速度比文件写入要快，因此，将回调发送到自己的线程中将使您受益最大（因此可以一次将多个文件写入队列）。 Python threads should release the GIL when waiting on IO (file writes), and are much lighter (faster to start up) than processes. Python线程在等待IO（文件写入）时应释放GIL，并且比进程轻（启动速度更快）。

Python多处理未使用完整的CPU核心

问题描述

1 个解决方案

解决方案1
0 2019-08-01 15:06:50

Python多处理未使用完整的CPU核心

问题描述

1 个解决方案

解决方案1 0 2019-08-01 15:06:50

解决方案1
0 2019-08-01 15:06:50