简体   繁体   English

python中嵌套循环的多处理

[英]Multiprocessing of nested loops in python

I need to write some files that take inputs from two different and very large lists. 我需要编写一些文件,这些文件需要来自两个不同且非常大的列表的输入。 The following python code works, but due to the size of the lists and other variables involved takes a long time to run: 以下python代码可以运行,但是由于列表的大小和涉及的其他变量需要很长时间才能运行:

for n,seq in enumerate(ugFA):    
    with open("locusFASTAs/"+loci[n], 'a') as outFA:
        SeqIO.write(ugSeqs[seq.id], outFA, 'fasta')
        for m,i in enumerate(wantedContigs):
            if f[m].id==seq.id:
                SeqIO.write(MergeSeqs[i], outFA, 'fasta')
            else: continue

Data structures in the above code: 上面代码中的数据结构:

  • ugFA is a list ugFA是一个列表
  • loci is a list 位点是一个列表
  • ugSeqs is a dictionary ugSeqs是一本字典
  • wantedContigs is a list wantedContigs是一个列表
  • f is a list f是一个列表
  • MergeSeqs is a dictionary MergeSeqs是一本字典

I have attempted to parallelise the code using multiprocessing . 我试图使用multiprocessing并行化代码。 The following code does the job , but (i) doesn't run any quicker, (ii) doesn't seem to use more than 100% CPU, and (iii) spits out the error message shown below when finished, even though it completes the tasks in the loop: 以下代码可以完成此任务 ,但(i)不能更快地运行, (ii)似乎不会使用超过100%的CPU,并且(iii)完成后会吐出下面显示的错误消息,即使它完成循环中的任务:

def extractContigs(ugFA, loci, ugSeqs, wantedContigs, f, MergeSeqs):
    from Bio import SeqIO
    for n,seq in enumerate(ugFA):    
        with open("locusFASTAs/"+loci[n], 'a') as outFA:
            SeqIO.write(ugSeqs[seq.id], outFA, 'fasta')
            for m,i in enumerate(wantedContigs):
                if f[m].id==seq.id:
                    SeqIO.write(MergeSeqs[i], outFA, 'fasta')
                else: continue

pool = multiprocessing.Pool(processes=p)
r = pool.map(extractContigs(ugFA, loci, ugSeqs, wantedContigs, MergeSeqs))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: map() takes at least 3 arguments (2 given)

Is there something I have done wrong in the construction of my code? 我在构建代码时做错了什么? I can I properly construct it to fully utilise the expediency of the multiprocessing module? 我可以正确构造它以充分利用multiprocessing模块的权宜性吗?

The problem is 问题是

r = pool.map(extractContigs(ugFA, loci, ugSeqs, wantedContigs, MergeSeqs))

is calling the function extractContigs (in the main thread, hence the 100% CPU) and then passing the results as an argument to pool.map . 正在调用函数extractContigs (在主线程中,因此为100%CPU),然后将结果作为参数传递给pool.map The correct signature is 正确的签名

pool.map(func, iterable)

For this to work in your case, you would need to rewrite the extractContigs function to take only one argument. 为了使这种情况有效,您需要重写extractContigs函数以仅接受一个参数。 From the looks of it, you'd need to significantly refactor your code to do this. 从外观上看,您需要对代码进行重大重构才能做到这一点。 Writing to the same file simultaneously might be a concern. 同时写入同一文件可能是一个问题。

Your previous version could be appropriately modified: 您先前的版本可以适当修改:

def writeLocus(n):
    seq = ugFa[n]  
    with open("locusFASTAs/"+loci[n], 'a') as outFA:
        SeqIO.write(ugSeqs[seq.id], outFA, 'fasta')
        for m,i in enumerate(wantedContigs):
            if f[m].id==seq.id:
                SeqIO.write(MergeSeqs[i], outFA, 'fasta')
            else: continue

pool.map(writeLocus, range(len(ugFa)))

Please make sure to verify that things aren't getting garbled in the output due to the parallel writing identical files. 请确保验证由于并行写入相同文件而导致输出中不会出现乱码。 Ideally, it would be best to have each worker write to its own file(s), and merge afterward. 理想情况下,最好让每个工作进程写入其自己的文件,然后再合并。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM