简体   繁体   English

在python循环中从交替文件打印行

[英]in python loop print lines from alternating files

I am trying to use python to find four-line blocks of interest in two separate files then print out some of those lines in controlled order. 我正在尝试使用python在两个单独的文件中找到感兴趣的四行代码,然后以受控顺序打印出其中的某些代码行。 Below are the two input files and an example of the desired output file. 以下是两个输入文件和所需输出文件的示例。 Note that the DNA sequence in the Input.fasta is different than the DNA sequence in Input.fastq because the .fasta file has been read corrected. 请注意,Input.fasta中的DNA序列与Input.fastq中的DNA序列不同,因为.fasta文件已被更正。

Input.fasta 输入法

>read1
AAAGGCTGT
>read2
AGTCTTTAT
>read3
CGTGCCGCT

Input.fastq Input.fastq

@read1
AAATGCTGT
+
'(''%$'))
@read2
AGTCTCTAT
+
&---+2010
@read3
AGTGTCGCT
+
0-23;:677

DesiredOutput.fastq DesiredOutput.fastq

@read1
AAAGGCTGT
+
'(''%$'))
@read2
AGTCTTTAT
+
&---+2010
@read3
CGTGCCGCT
+
0-23;:677

Basically I need the sequence line "AAAGGCTGT", "AGTCTTTAT", and "CGTGCCGCT" from "input.fasta" and all other lines from "input.fastq". 基本上,我需要“ input.fasta”中的序列行“ AAAGGCTGT”,“ AGTCTTTAT”和“ CGTGCCGCT”以及“ input.fastq”中的所有其他行。 This allows the restoration of quality information to a read corrected .fasta file. 这样可以将质量信息恢复到已读取的校正后的.fasta文件中。

Here is my closest failed attempt: 这是我最接近的失败尝试:

fastq = open(Input.fastq, "r")
fasta = open(Input.fasta, "r")

ReadIDs = []
IDs = []

with fastq as fq:
    for line in fq:
        if "read" in line:  
            ReadIDs.append(line)
            print(line.strip())
            for ID in ReadIDs:
                IDs.append(ID[1:6])
            with fasta as fa:
                for line in fa:
                    if any(string in line for string in IDs):
                        print(next(fa).strip())
            next(fq)
            print(next(fq).strip())
            print(next(fq).strip())

I think I am running into trouble by trying to nest "with" calls to two different files in the same loop. 我想通过在同一循环中嵌套对两个不同文件的“ with”调用遇到麻烦。 This prints the desired lines for read1 correctly but does not continue to iterate through the remaining lines and throws an error "ValueError: I/O operation on closed file" 这将正确打印read1的所需行,但不会继续遍历其余行,并引发错误“ ValueError:对已关闭文件的I / O操作”

I suggest you use Biopython , which will save you a lot of trouble as it provides nice parsers for these file formats, which handle not only the standard cases but also for example multi-line fasta. 我建议您使用Biopython ,这将为您节省很多麻烦,因为它为这些文件格式提供了不错的解析器,不仅可以处理标准情况,还可以处理例如多行fasta。

Here is an implementation that replaces the fastq sequence lines with the corresponding fasta sequence lines: 这是用相应的fasta序列行替换fastq序列行的实现:

from Bio import SeqIO

fasta_dict = {record.id: record.seq for record in
              SeqIO.parse('Input.fasta', 'fasta')}

def yield_records():
    for record in SeqIO.parse('Input.fastq', 'fastq'):
        record.seq = fasta_dict[record.id]
        yield record

SeqIO.write(yield_records(), 'DesiredOutput.fastq', 'fastq')

If you don't want to use the headers but just rely on the order then the solution is even simpler and more memory efficient (just make sure the order and number of records is the same), no need to define the dictionary first, just iterate over the records together: 如果您不想使用标题而是仅依赖顺序,那么该解决方案甚至更简单且内存效率更高(只需确保顺序和记录数相同),无需先定义字典,只需一起遍历记录:

fasta_records = SeqIO.parse('Input.fasta', 'fasta')
fastq_records = SeqIO.parse('Input.fastq', 'fastq')

def yield_records():
    for fasta_record, fastq_record in zip(fasta_records, fastq_records):
        fastq_record.seq = fasta_record.seq
        yield fastq_record

I like the Biopython solution by @Chris_Rands better for small files, but here is a solution that only uses the batteries included with Python and is memory efficient. 对于小文件,我更喜欢@Chris_Rands的Biopython解决方案 ,但是这里的解决方案仅使用Python附带的电池,并且内存效率高。 It assumes the fasta and fastq files to contain the same number of reads in the same order. 假定fasta和fastq文件包含相同顺序的相同读取次数。

with open('Input.fasta') as fasta, open('Input.fastq') as fastq, open('DesiredOutput.fastq', 'w') as fo:
    for i, line in enumerate(fastq):
        if i % 4 == 1:
            for j in range(2):
                line = fasta.readline()
        print(line, end='', file=fo)
## Open the files (and close them after the 'with' block ends)
with open("Input.fastq", "r") as fq, open("Input.fasta", "r") as fa:

    ## Read in the Input.fastq file and save its content to a list
    fastq = fq.readlines()

    ## Do the same for the Input.fasta file
    fasta = fa.readlines()


## For every line in the Input.fastq file
for i in range(len(fastq)):
    print(fastq[i]))
    print(fasta[2 * i])
    print(fasta[(2 * i) + 1])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM