Python 多处理 imap 没有列表理解？

Question

我已经使用 imap 和 pyfastx 库并行化了我的代码，但问题是使用列表理解加载了序列。 当输入文件很大时，这会出现问题，因为所有seq值都加载到 memory 中。 有没有办法在不完全填充输入到 imap 的列表的情况下做到这一点？

import pyfastx
import multiprocessing

def pSeq(seq):
...
  return(A1,A2,B)

pool=multiprocessing.Pool(5)
for (A1,A2,B) in
      pool.imap(pSeq,[seq for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False)],chunksize=100000):
  if A1 == A2 and A1 != B: 
    matchedA[A1][B] += 1

我还尝试跳过列表理解并使用apply_async function，因为 pyfastx 支持一次加载一个序列，但是因为每个单独的循环都很短并且没有chunksize参数，这最终花费的时间比根本不使用多处理要长。

import pyfastx
import multiprocessing

def pSeq(seq):
...
  return(A1,A2,B)

pool=multiprocessing.Pool(5)
results = []
for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False):
  results.append(pool.apply_async(pSeq,seq))
pool.join()
pool.close()

for result in results:
  if result[0] == result[1] and result[0] != result[2]:
    matchedA[result[0]][result[2]] +=1

有什么建议么？

Answer 1

我知道距离最初的帖子已经有一段时间了，但我实际上处理了一个类似的问题，所以认为这在某些时候可能对某人有所帮助。

首先，一般的解决方案是给 imap 传递一个迭代器或生成器 object，而不是一个列表。 在这种情况下，您将修改 pSeq 以接受 3 的元组并简单地删除列表推导。 我在下面包含了一些代码来说明我的意思，但让我先发制人尝试这个 - 它不起作用（至少在我手中）。 我猜这是因为，出于某种原因，pyfastx.Fastq 没有返回迭代器或生成器 object（我确实验证了这个花絮 - 返回的 object 接下来没有实现）...我通过使用 fastq- 解决了这个问题- and-furious，它的速度相当快并且确实返回了一个生成器（并且还具有更灵活的 python API）。 如果您想跳过“应该有效的解决方案”，则该解决方法代码位于底部。 无论如何，这就是我想要的工作：

def pSeq(seq_tuple):
    _, seq, _ = seq_tuple
    ...
    return(A1,A2,B)

...
import multiprocessing as mp
with mp.Pool(5) as pool:
    # this fails (when I ran it on Mac, the program hung and I had to keyboard interrupt)
    # most likely due to pyfastx.Fastq not returning a generator or iterator
    parser = pyfastx.Fastq(temp2.name, build_index=False)
    result_iterator = pool.imap(pSeq, parser, chunksize=100000)
    for result in result_itertor:
        do something

为了使这个答案完整，我还添加了我的解决方法代码，这对我有用。 不幸的是，在仍然使用 pyfastx 时我无法让它正常运行：

import fastqandfurious.fastqandfurious as fqf
import fastqandfurious._fastqandfurious as _fqf

# if you don't supply an entry function, fqf returns (name, seq, quality) as byte-strings
def pfx_like_entry(buf, pos, offset=0):
    """
    Return a tuple with identical format to pyfastx, so reads can be 
    processed with the same function regardless of which parser we use
    """
    name = buf[pos[0]:pos[1]].decode('ascii')
    seq = buf[pos[2]:pos[3]].decode('ascii')
    quality = buf[pos[4]:pos[5]].decode('ascii')
    return name, seq, quality

# can be replaced with fqf.automagic_open(), gzip.open(), some other equivalent
with open(temp2.name, mode='rb') as handle, \
        mp.Pool(5) as pool:
    # this does work. You can also use biopython's fastq parsers
    # (or any other parser that returns an iterator/ generator)
    parser = fqf.readfastq_iter(fh=handle,
                                fbufsize=20000,
                                entryfunc=pfx_like_entry
                                _entrypos=_fqf.entrypos
                                )
    result_iterator = pool.imap(pSeq, parser, chunksize=100000)
    for result in result_itertor:
        do something

Python 多处理 imap 没有列表理解？

问题描述

1 个解决方案

解决方案1
0 2022-08-01 22:19:21

Python 多处理 imap 没有列表理解？

问题描述

1 个解决方案

解决方案1 0 2022-08-01 22:19:21

解决方案1
0 2022-08-01 22:19:21