简体   繁体   English

如何使用readline()从第二行开始?

[英]How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format: 我正在用Python编写一个简短的程序来读取FASTA文件,该文件通常采用以下格式:

>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA

I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence. 我已经创建了另一个程序来读取这个FASTA文件的第一行(又名标题),现在我希望第二个程序从序列开始读取和打印。

How would I do that? 我该怎么办?

so far i have this: 到目前为止我有这个:

FASTA = open("test.txt", "r")

def readSeq(FASTA):
    """returns the DNA sequence of a FASTA file"""
    for line in FASTA:
        line = line.strip()
        print line          


readSeq(FASTA)

Thanks guys 多谢你们

-Noob -菜鸟

def readSeq(FASTA):
    """returns the DNA sequence of a FASTA file"""
    _unused = FASTA.next() # skip heading record
    for line in FASTA:
        line = line.strip()
        print line  

Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file: 阅读file.next()上的文档 ,了解为什么要警惕将file.readline()for line in file:

you should show your script. 你应该显示你的脚本。 To read from second line, something like this 要从第二行读取,就像这样

f=open("file")
f.readline()
for line in f:
    print line
f.close()

You might be interested in checking BioPythons handling of Fasta files ( source ). 您可能有兴趣检查BioPythons处理Fasta文件( source )。

def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
    """Generator function to iterate over Fasta records (as SeqRecord objects).

handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.

If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.

Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
    #Skip any text before the first record (e.g. blank lines, comments)
    while True:
        line = handle.readline()
        if line == "" : return #Premature end of file, or just empty?
        if line[0] == ">":
            break

    while True:
        if line[0]!=">":
            raise ValueError("Records in Fasta files should start with '>' character")
        if title2ids:
            id, name, descr = title2ids(line[1:].rstrip())
        else:
            descr = line[1:].rstrip()
            id = descr.split()[0]
            name = id

        lines = []
        line = handle.readline()
        while True:
            if not line : break
            if line[0] == ">": break
            #Remove trailing whitespace, and any internal spaces
            #(and any embedded \r which are possible in mangled files
            #when not opened in universal read lines mode)
            lines.append(line.rstrip().replace(" ","").replace("\r",""))
            line = handle.readline()

        #Return the record and then continue...
        yield SeqRecord(Seq("".join(lines), alphabet),
                         id = id, name = name, description = descr)

        if not line : return #StopIteration
    assert False, "Should not reach this line"

good to see another bioinformatician :) 很高兴看到另一位生物信息学家:)

just include an if clause within your for loop above the line.strip() call 只需在line.strip()调用之上的for循环中包含一个if子句

def readSeq(FASTA):
    for line in FASTA:
        if line.startswith('>'):
            continue
        line = line.strip()
        print(line)

A pythonic and simple way to do this would be slice notation. 一个pythonic和简单的方法是切片表示法。

>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']

That says "give me all elements of lines, from the second (index 1) to the end. 这说“给我所有的线条元素,从第二个(索引1)到结尾。

Other general uses of slice notation: 切片表示法的其他一般用途:

s[i:j]  slice of s from i to j
s[i:j:k]    slice of s from i to j with step k (k can be negative to go backward)

Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end. 可以省略i或j(表示开头或结尾),j可以是负数,表示从末尾开始的多个元素。

s[:-1]     All but the last element. 

Edit in response to gnibbler's comment: 编辑以响应gnibbler的评论:

If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory. 如果文件真的很大,你可以使用迭代器切片来获得相同的效果,同时确保你不会在内存中获得整个内容。

import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1): 
    print line

"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember. “islicing”没有常规切片的漂亮语法或额外功能,但它是一种很好的记忆方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM