[英]How to skip reading the first line of an input file in Python
I'm trying to write a script that will automatically process the files containing DNA sequences and convert them into a protein sequence. 我正在尝试编写一个脚本,该脚本将自动处理包含DNA序列的文件并将其转换为蛋白质序列。 My only hiccup thus far is that in order for my script to process the infile I need to somehow omit the first line.
到目前为止,我唯一遇到的麻烦是,为了让我的脚本处理文件内文件,我需要以某种方式省略第一行。
My code looks like this: 我的代码如下所示:
#!/usr/bin/python
### This script will translate in all three frames ###
import glob
from Bio.Seq import *
from Bio.Alphabet import generic_dna
print "Please drag in the directory to be processed: "
folder = raw_input().replace(" ","")
file = glob.glob(str(folder) + "/" + '*.seq')
for i in file:
with open (i, "r") as myfile:
### Need to somehow remove / read over the first line of the input...###
seq = myfile.read().replace(" ", "").replace("\n", "")
x = 0
output = open(i + ".tran", "w+")
while x < 3:
cd = Seq(seq[x:], generic_dna)
qes = seq[::-1]
cdr = Seq(qes[x:], generic_dna)
error = cd.translate().count('*')
reverror = cdr.translate().count('*')
output.write(str(cd.translate()) + "\nStops: " + str(error) + "\n\n")
output.write("Reverse\n" + str(cdr.translate()) + "\nStops: " + str(reverror) + "\n\n")
x += 1
And my input files might look something like this 我的输入文件可能看起来像这样
>G2-pBAD-Forward_A11.ab1
NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGA
GCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGT
ACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACA
GATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCA
TGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCA
TCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTA
GGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACT
GCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCT
AAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATG
AGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCT
GGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCA
ACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATT
GCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTAT
CGNANTTNNCTACACGAN
What I need is an elegant way to remove that first line containing the '>'. 我需要的是一种优雅的方法来删除包含“>”的第一行。
You can use SeqIO : 您可以使用SeqIO :
with open(i) as myfile:
for record in SeqIO.parse(myfile, "fasta"):
print record.id
print record.seq
record.seq
will give you the sequence, record.id
will give you the id, if you only have or want a single sequence in each you can just call next: record.seq
会给你的序列, record.id
会给你的ID,如果你只拥有或者希望在每一个序列你可以调用next:
with open(i) as myfile:
print(next(SeqIO.parse(myfile, "fasta"))).seq
I don't see any spaces in your input so I am not sure how replace would work, this will output the sequence as a single string. 我在您的输入中看不到任何空格,所以我不确定替换将如何工作,这会将序列输出为单个字符串。
Output: 输出:
NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGAGCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGTACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACAGATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCATGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTAGGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACTGCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTATCGNANTTNNCTACACGAN
You can also use range instead of your while loop and pass the alphabet to SeQIO: 您还可以使用range而不是while循环,并将字母传递给SeQIO:
for record in SeqIO.parse(myfile, "fasta",generic_dna)
...
for x in range(3):
....
This should be closer to what you want: 这应该更接近您想要的:
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
folder = raw_input().replace(" ","")
files = glob.glob(folder + "/" + '*.seq')
for i in files:
with open(i) as myfile:
seq = next(SeqIO.parse(myfile, "fasta", generic_dna)).seq
qes = seq[::-1]
with open("{}.tran".format(i), "w+") as output:
for x in range(3):
cd = seq[x:]
cdr = qes[x:]
error = seq.translate().count('*')
reverror = cdr.translate().count('*')
output.write("{}\nstops: {}\n\n".format(cd.translate(), error))
output.write("Reverse: {}\nStops: {}\n\n ".format(cdr.translate(), reverror))
Which outputs: 哪个输出:
XXXXXXXFLSQLSTVSPYPFFWASEFELEIILFNFKKEIYI*L**NELLPCLLYC*WAGR*MPGQAWLFWRMREDFQPDTD*IRTQKRSDKTEFAWRQ*RGGPT*PHAELRSETP*RRW*CGVSPCESRELPGIK*NERLSRKTGPFVLSVVCR*TLS*VGQIRRERI*TLRSNGPEGGGQDARHKLPGIKLSRRPS*RMAFLRFYKLFCLFF*IHSNMYPLMRQ*P**MLQ*Y*KRKSMSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGAANY*LANYLL*LPGNN**TGWRRIKLQDHFCARPFRLAGFIADKSGAGERXLAVSLQHWGQMVSPSRIXXXYT
stops: 25
Reverse: XHIXXXYALPEW*TGVTTLLWRSXASGRGLNSRYLGRSAFPARVFTRTLK*AEVGQIINNGPSISFIKRSIIKRRGLTRSRRK*KWSQRPTRFCPSVLRRFFPYSRCAFTTYEYEKEKVIITS*IVPITEYSPMYKLT*IFLFVFSNIFAFFR*AVLPEDELNYGPSNTARRTGGGRPGNEALQV*ARAA*TG*VLSQVAVCCLFCFPGQKADSESKINYGPSRDESVPLWGVMVAAMPQSEDSSRTPVHPGGAMTAVRLRQNSLAKTQD*IRHSPTFRRE*AVLSVRTGP*MAGRVVVMSFVPLLSKVMLVYI*RKNFNLF**SSSLSDRVFLPIPLCHLSTLFFXXXXXX
Stops: 15
XXXXXXAFYRNSLLFLHTRFFGLANSSSK*FCLTLRRRYTYDCNETSYYPVCCTADGLVGKCLVKLGCFGG*EKIFSLIQIKSERRSGLIKQNLPGGSSAVVPPDPMPNSEVKRRSADGSVGSPHARVGNCQASNKTKGSVERLGLSFYLLFVGERSPE*DKSAGSGFERCEATARRVAGRTPAINCQASN*AEGHPDGWPFCVSTNSFVYFSKYIQICIRS*DNNPDKCFNNIEKGRV*VFNISVSPLFPFLRHFAFLFLLTQKRW*K*KMLKISWVQQTINWRTTYSSFPATINRLDGGG*SCRTTSALGPSGWLGLLLINLEPVSVXSRYHCSTGARW*ALPVSXXXTR
stops: 25
Reverse: STSXXAMPFPNGRPGSRRYYGALVRVAEV*IVVIWVGRPSRLASSPGR*NRRR*VR*LTTALRSHSSSGQLSNDVG*LEVVENESGRKDPLVFVLPFYGVFSLIPAVPLQLMSMRRKKL**LRK*SQ*QSTRLCINLHKSFYLFSQTSLRFSGRQSYRKTN*TTDRQIPPAGRAVGGPATKRCKFRRGPPKQDESSRKWLFVVYFAFRVRKLTRKAK*TTDRQGMRAYPSGV*W*PRCRKVKTQAVPQSTLVAR*RRSV*DKIVWRRRKTKLDIVRLLEESRRFCRFELVRKWLVG*SSCRLSHY*AK*C*YTYRGRISICFNKARA*AIGFFCPYLFVISQRYFSXXXXXX
Stops: 20
XXXXXXLFIATLYCFSIPVFLG*RIRARNNFV*L*EGDIHMIVMKRVITLFAVLLMGWSVNAWSSLAVLADERRFSA*YRLNQNAEAV**NRICLAAVARWSHLTPCRTQK*NAVAPMVVWGLPMRE*GTARHQIKRKAQSKDWAFRFICCLSVNALLSRTNPPGADLNVAKQRPGGWRAGRPP*TARHQIKQKAILTDGLFAFLQTLLFIFLNTFKYVSAHETITLINASIILKKEEYEYSTFPCRPYSLFCGILPSCFCSPRNAGESKRC*RSVGCSKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWVYC**IWSR*AXSRGIIAALGPDGKPFPYRXXLHX
stops: 25
Reverse: AHXXXLCPSRMVDRGHDVTMALXCEWPRSK*SLFGSVGLPGSRLHQDVEIGGGRSDN*QRPFDLIHQAVNYQTTWVD*KS*KMKVVAKTHSFLSFRFTAFFPLFPLCLYNL*V*EGKSYNNFVNSPNNRVLAYV*TYINLFICFLKHLCVFPVGSPTGRRIKLRTVKYRPQDGRWEARQRSVASLGEGRLNRMSPLASGCLLSILLSGSES*LGKQNKLRTVKG*ERTPLGCDGSRDAAK*RLKPYPSPPWWRDDGGPFKTK*SGEDARLN*T*SDF*KRVGGFVGSNWSVNGWSGSRHVVCPIIEQSNVSIHIEEEFQFVLIKLELKRSGFFAHTSLSSLNAIFRXXXXXX
Stops: 14
Although there is a warning because your sequences are not a multiple of three 尽管出现警告,因为您的序列不是三的倍数
starting = True
lines = []
with open (i, "r") as myfile:
for line in myfile:
if starting:
starting = False
continue
lines.append(line)
# now you have all the lines except the first one in "lines"
When you go to read the file check to see if the first character of the line is '>'. 当您阅读文件时,请检查该行的第一个字符是否为'>'。 If it is then just read the line and skip it.
如果是这样,则只需阅读该行并跳过它。 Like the comment above mentioned, you may want to check this each time you read a line so you know if you are dealing with a new sequence or the same sequence, as your file may have multiple sequences.
就像上面提到的注释一样,您可能希望在每次读一行时都进行检查,以便知道您是在处理新序列还是相同序列,因为您的文件可能具有多个序列。
This link will give you all of the documentation on reading a file in python. 该链接将为您提供有关读取python文件的所有文档。 https://docs.python.org/2/tutorial/inputoutput.html
https://docs.python.org/2/tutorial/inputoutput.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.