简体   繁体   中英

Python print lines after context

How can I print two lines after the context i am interest in using python.

Example.fastq

@read1
AAAGGCTGTACTTCGTTCCAGTTG
+
'(''%$'))%**)2+'.(&&'/5-
@read2
CTGAGTTGAGTTAGTGTTGACTC
+
)(+-0-2145=588..,(1-,12

I can find the context of interest using...

fastq = open(Example.fastq, "r")

IDs = [read1]

with fastq as fq:
    for line in fq:
        if any(string in line for string in IDs):

Now that I have found read1 I want to print out the the following lines for read1. In bash i might use something like grep -A to do this. The desired output lines look like the following.

+
'(''%$'))%**)2+'.(&&'/5-

But in python i cant seem to find an equivalent tool. Perhaps "islice" might work but I don't see how I can get islice to start at the position of the match.

with fastq as fq:
    for line in fq:
        if any(string in line for string in IDs):
            print(list(islice(fq,3,4)))

You can use next() to advance an iterator (including files):

print(next(fq))
print(next(fq))

This consumes those lines, so the for loop will continue with @read2 .

if you don't want the AAA... line, you can also just consume it with next(fq) . In full:

fastq = open(Example.fastq, "r")

IDs = [read1]

with fastq as fq:
    for line in fq:
        if any(string in line for string in IDs):
            next(fq)  # skip AAA line
            print(next(fq).strip())  # strip off the extra newlines
            print(next(fq).strip())

which gives

+
'(''%$'))%**)2+'.(&&'/5-

If you're handling FASTQ files, you're better off using a bioinformatics library like BioPython instead of rolling your own parser.

To get the exact result you requested, you can do:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

IDs = ['read1']

with open('Example.fastq') as in_handle:
    for title, seq, qual in FastqGeneralIterator(in_handle):
        # The ID is the first word in the title line (after the @ sign):
        if title.split(None, 1)[0] in IDs:
            # Line 3 is always a '+', optionally followed by the same sequence identifier again.
            print('+') 
            print(qual)

But you can't do much with the line of quality values on its own. Your next step will be almost certainly be to convert it to Phred quality scores . But this is notoriously complicated because there are at least three different and incompatible variants of the FASTQ file format . BioPython takes care of all the edge cases for you, so that you can just do:

from Bio.SeqIO import parse

IDs = ['read1']

with open('Example.fastq') as in_handle:
    for record in parse(in_handle, 'fastq'):
        if record.id in IDs:
            print(record.letter_annotations["phred_quality"])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM