简体   繁体   English

如何在Python中跳过读取输入文件的第一行

[英]How to skip reading the first line of an input file in Python

I'm trying to write a script that will automatically process the files containing DNA sequences and convert them into a protein sequence. 我正在尝试编写一个脚本,该脚本将自动处理包含DNA序列的文件并将其转换为蛋白质序列。 My only hiccup thus far is that in order for my script to process the infile I need to somehow omit the first line. 到目前为止,我唯一遇到的麻烦是,为了让我的脚本处理文件内文件,我需要以某种方式省略第一行。

My code looks like this: 我的代码如下所示:

#!/usr/bin/python

### This script will translate in all three frames ###

import glob
from Bio.Seq import *
from Bio.Alphabet import generic_dna

print "Please drag in the directory to be processed: "
folder = raw_input().replace(" ","")
file = glob.glob(str(folder) + "/" + '*.seq')

for i in file:

        with open (i, "r") as myfile:

                ### Need to somehow remove / read over the first line of the input...###

                seq = myfile.read().replace(" ", "").replace("\n", "")

        x = 0

        output = open(i + ".tran", "w+")

        while x < 3:
                cd = Seq(seq[x:], generic_dna)
                qes = seq[::-1]
                cdr = Seq(qes[x:], generic_dna)
                error = cd.translate().count('*')
                reverror = cdr.translate().count('*')
                output.write(str(cd.translate()) + "\nStops: " + str(error) + "\n\n")
                output.write("Reverse\n" + str(cdr.translate()) + "\nStops: " + str(reverror) + "\n\n")
                x += 1

And my input files might look something like this 我的输入文件可能看起来像这样

>G2-pBAD-Forward_A11.ab1
NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGA
GCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGT
ACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACA
GATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCA
TGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCA
TCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTA
GGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACT
GCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCT
AAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATG
AGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCT
GGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCA
ACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATT
GCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTAT
CGNANTTNNCTACACGAN

What I need is an elegant way to remove that first line containing the '>'. 我需要的是一种优雅的方法来删除包含“>”的第一行。

You can use SeqIO : 您可以使用SeqIO

with open(i) as myfile:
    for record in SeqIO.parse(myfile, "fasta"):
       print record.id
       print record.seq  

record.seq will give you the sequence, record.id will give you the id, if you only have or want a single sequence in each you can just call next: record.seq会给你的序列, record.id会给你的ID,如果你只拥有或者希望在每一个序列你可以调用next:

with open(i) as myfile:
    print(next(SeqIO.parse(myfile, "fasta"))).seq

I don't see any spaces in your input so I am not sure how replace would work, this will output the sequence as a single string. 我在您的输入中看不到任何空格,所以我不确定替换将如何工作,这会将序列输出为单个字符串。

Output: 输出:

NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGAGCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGTACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACAGATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCATGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTAGGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACTGCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTATCGNANTTNNCTACACGAN

You can also use range instead of your while loop and pass the alphabet to SeQIO: 您还可以使用range而不是while循环,并将字母传递给SeQIO:

for record in SeqIO.parse(myfile, "fasta",generic_dna)
    ...
for x in range(3):
       ....

This should be closer to what you want: 这应该更接近您想要的:

from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq


folder = raw_input().replace(" ","")
files = glob.glob(folder + "/" + '*.seq')

for i in files:
    with open(i) as myfile:
            seq = next(SeqIO.parse(myfile, "fasta", generic_dna)).seq
            qes = seq[::-1]
            with open("{}.tran".format(i), "w+") as output:
                 for x in range(3):
                     cd = seq[x:]
                     cdr = qes[x:]
                     error = seq.translate().count('*')
                     reverror = cdr.translate().count('*')
                     output.write("{}\nstops: {}\n\n".format(cd.translate(), error))
                     output.write("Reverse: {}\nStops: {}\n\n ".format(cdr.translate(), reverror))

Which outputs: 哪个输出:

XXXXXXXFLSQLSTVSPYPFFWASEFELEIILFNFKKEIYI*L**NELLPCLLYC*WAGR*MPGQAWLFWRMREDFQPDTD*IRTQKRSDKTEFAWRQ*RGGPT*PHAELRSETP*RRW*CGVSPCESRELPGIK*NERLSRKTGPFVLSVVCR*TLS*VGQIRRERI*TLRSNGPEGGGQDARHKLPGIKLSRRPS*RMAFLRFYKLFCLFF*IHSNMYPLMRQ*P**MLQ*Y*KRKSMSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGAANY*LANYLL*LPGNN**TGWRRIKLQDHFCARPFRLAGFIADKSGAGERXLAVSLQHWGQMVSPSRIXXXYT
stops: 25

Reverse: XHIXXXYALPEW*TGVTTLLWRSXASGRGLNSRYLGRSAFPARVFTRTLK*AEVGQIINNGPSISFIKRSIIKRRGLTRSRRK*KWSQRPTRFCPSVLRRFFPYSRCAFTTYEYEKEKVIITS*IVPITEYSPMYKLT*IFLFVFSNIFAFFR*AVLPEDELNYGPSNTARRTGGGRPGNEALQV*ARAA*TG*VLSQVAVCCLFCFPGQKADSESKINYGPSRDESVPLWGVMVAAMPQSEDSSRTPVHPGGAMTAVRLRQNSLAKTQD*IRHSPTFRRE*AVLSVRTGP*MAGRVVVMSFVPLLSKVMLVYI*RKNFNLF**SSSLSDRVFLPIPLCHLSTLFFXXXXXX
Stops: 15

 XXXXXXAFYRNSLLFLHTRFFGLANSSSK*FCLTLRRRYTYDCNETSYYPVCCTADGLVGKCLVKLGCFGG*EKIFSLIQIKSERRSGLIKQNLPGGSSAVVPPDPMPNSEVKRRSADGSVGSPHARVGNCQASNKTKGSVERLGLSFYLLFVGERSPE*DKSAGSGFERCEATARRVAGRTPAINCQASN*AEGHPDGWPFCVSTNSFVYFSKYIQICIRS*DNNPDKCFNNIEKGRV*VFNISVSPLFPFLRHFAFLFLLTQKRW*K*KMLKISWVQQTINWRTTYSSFPATINRLDGGG*SCRTTSALGPSGWLGLLLINLEPVSVXSRYHCSTGARW*ALPVSXXXTR
stops: 25

Reverse: STSXXAMPFPNGRPGSRRYYGALVRVAEV*IVVIWVGRPSRLASSPGR*NRRR*VR*LTTALRSHSSSGQLSNDVG*LEVVENESGRKDPLVFVLPFYGVFSLIPAVPLQLMSMRRKKL**LRK*SQ*QSTRLCINLHKSFYLFSQTSLRFSGRQSYRKTN*TTDRQIPPAGRAVGGPATKRCKFRRGPPKQDESSRKWLFVVYFAFRVRKLTRKAK*TTDRQGMRAYPSGV*W*PRCRKVKTQAVPQSTLVAR*RRSV*DKIVWRRRKTKLDIVRLLEESRRFCRFELVRKWLVG*SSCRLSHY*AK*C*YTYRGRISICFNKARA*AIGFFCPYLFVISQRYFSXXXXXX
Stops: 20

 XXXXXXLFIATLYCFSIPVFLG*RIRARNNFV*L*EGDIHMIVMKRVITLFAVLLMGWSVNAWSSLAVLADERRFSA*YRLNQNAEAV**NRICLAAVARWSHLTPCRTQK*NAVAPMVVWGLPMRE*GTARHQIKRKAQSKDWAFRFICCLSVNALLSRTNPPGADLNVAKQRPGGWRAGRPP*TARHQIKQKAILTDGLFAFLQTLLFIFLNTFKYVSAHETITLINASIILKKEEYEYSTFPCRPYSLFCGILPSCFCSPRNAGESKRC*RSVGCSKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWVYC**IWSR*AXSRGIIAALGPDGKPFPYRXXLHX
stops: 25

Reverse: AHXXXLCPSRMVDRGHDVTMALXCEWPRSK*SLFGSVGLPGSRLHQDVEIGGGRSDN*QRPFDLIHQAVNYQTTWVD*KS*KMKVVAKTHSFLSFRFTAFFPLFPLCLYNL*V*EGKSYNNFVNSPNNRVLAYV*TYINLFICFLKHLCVFPVGSPTGRRIKLRTVKYRPQDGRWEARQRSVASLGEGRLNRMSPLASGCLLSILLSGSES*LGKQNKLRTVKG*ERTPLGCDGSRDAAK*RLKPYPSPPWWRDDGGPFKTK*SGEDARLN*T*SDF*KRVGGFVGSNWSVNGWSGSRHVVCPIIEQSNVSIHIEEEFQFVLIKLELKRSGFFAHTSLSSLNAIFRXXXXXX
Stops: 14

Although there is a warning because your sequences are not a multiple of three 尽管出现警告,因为您的序列不是三的倍数

starting = True
lines = []
with open (i, "r") as myfile:
    for line in myfile:
        if starting:
            starting = False
            continue
        lines.append(line)

# now you have all the lines except the first one in "lines"     

When you go to read the file check to see if the first character of the line is '>'. 当您阅读文件时,请检查该行的第一个字符是否为'>'。 If it is then just read the line and skip it. 如果是这样,则只需阅读该行并跳过它。 Like the comment above mentioned, you may want to check this each time you read a line so you know if you are dealing with a new sequence or the same sequence, as your file may have multiple sequences. 就像上面提到的注释一样,您可能希望在每次读一行时都进行检查,以便知道您是在处理新序列还是相同序列,因为您的文件可能具有多个序列。

This link will give you all of the documentation on reading a file in python. 该链接将为您提供有关读取python文件的所有文档。 https://docs.python.org/2/tutorial/inputoutput.html https://docs.python.org/2/tutorial/inputoutput.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM