使用set（）和FastqGeneralIterator（）從fastq文件中提取序列的子集

Question

我有兩個fastq文件，我只需要共享的fastq記錄。 但是，當編寫兩個僅包含匹配記錄的不同文件時，我的腳本失敗。 我正在使用set（）來優化內存使用。 有人可以幫我解決問題嗎？ 這是代碼：

from Bio.SeqIO.QualityIO import FastqGeneralIterator

infileR1= open('R1.fastq', 'r')
infileR2= open('R2.fastq', 'r')
output1= open('matchedR1.fastq', 'w')
output2= open('matchedR2.fastq', 'w')

all_names1 = set()
for line in infileR1 :
    if line[0:11] == '@GWZHISEQ01':
        read_name = line.split()[0]
        all_names1.add(read_name)

all_names2 = set()
for line in infileR2 :
    if line[0:11] == '@GWZHISEQ01':
        read_name = line.split()[0]
        all_names2.add(read_name)

shared_names = set()
for item in all_names1:
    if item in all_names2:
        shared_names.add(item)

#printing out the files:

for title, seq, qual in FastqGeneralIterator(infileR1):
    if title in new:
        output1.write("%s\n%s\n+\n%s\n" % (title, seq, qual))

for title, seq, qual in FastqGeneralIterator(infileR2):
    if title in shared_names:
        output2.write("%s\n%s\n+\n%s\n" % (title, seq, qual))

infileR1.close() 
infileR2.close()
output1.close()
output2.close()

Answer 1

在不知道確切錯誤的情況下（應該添加錯誤描述，而不是僅僅說“失敗”），我想您是在重新使用耗盡的處理程序。

使用infileR1= open('R1.fastq', 'r')打開處理程序
然后，您for line in infileR1:讀取帶有for line in infileR1:的文件以獲取標題。
最終，您將相同的處理程序傳遞給FastqGeneralIterator ，但是指針位於文件的末尾，因此Iterator已經位於文件的末尾並且不產生任何結果。

您應該在最后一次循環之前使用infileR1.seek(0) “倒帶”文件，或者按照傳遞文件名的文檔中的建議更改代碼以使用SeqIO包裝器：

infileR1.close()

for record in SeqIO.parse("R1.fastq", "fastq"):
    # Do business here

使用set（）和FastqGeneralIterator（）從fastq文件中提取序列的子集

問題描述

1 個解決方案

解決方案1
0 2015-01-07 07:49:57

使用set（）和FastqGeneralIterator（）從fastq文件中提取序列的子集

問題描述

1 個解決方案

解決方案1 0 2015-01-07 07:49:57

解決方案1
0 2015-01-07 07:49:57