[英]How to skip reading the first line of an input file in Python
我正在嘗試編寫一個腳本,該腳本將自動處理包含DNA序列的文件並將其轉換為蛋白質序列。 到目前為止,我唯一遇到的麻煩是,為了讓我的腳本處理文件內文件,我需要以某種方式省略第一行。
我的代碼如下所示:
#!/usr/bin/python
### This script will translate in all three frames ###
import glob
from Bio.Seq import *
from Bio.Alphabet import generic_dna
print "Please drag in the directory to be processed: "
folder = raw_input().replace(" ","")
file = glob.glob(str(folder) + "/" + '*.seq')
for i in file:
with open (i, "r") as myfile:
### Need to somehow remove / read over the first line of the input...###
seq = myfile.read().replace(" ", "").replace("\n", "")
x = 0
output = open(i + ".tran", "w+")
while x < 3:
cd = Seq(seq[x:], generic_dna)
qes = seq[::-1]
cdr = Seq(qes[x:], generic_dna)
error = cd.translate().count('*')
reverror = cdr.translate().count('*')
output.write(str(cd.translate()) + "\nStops: " + str(error) + "\n\n")
output.write("Reverse\n" + str(cdr.translate()) + "\nStops: " + str(reverror) + "\n\n")
x += 1
我的輸入文件可能看起來像這樣
>G2-pBAD-Forward_A11.ab1
NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGA
GCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGT
ACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACA
GATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCA
TGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCA
TCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTA
GGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACT
GCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCT
AAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATG
AGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCT
GGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCA
ACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATT
GCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTAT
CGNANTTNNCTACACGAN
我需要的是一種優雅的方法來刪除包含“>”的第一行。
您可以使用SeqIO :
with open(i) as myfile:
for record in SeqIO.parse(myfile, "fasta"):
print record.id
print record.seq
record.seq
會給你的序列, record.id
會給你的ID,如果你只擁有或者希望在每一個序列你可以調用next:
with open(i) as myfile:
print(next(SeqIO.parse(myfile, "fasta"))).seq
我在您的輸入中看不到任何空格,所以我不確定替換將如何工作,這會將序列輸出為單個字符串。
輸出:
NNNNNNNNNNCNNNCNGNNGCTTTTTATCGCAACTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCGAGCTCGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGATTGTAATGAAACGAGTTATTACCCTGTTTGCTGTACTGCTGATGGGCTGGTCGGTAAATGCCTGGTCAAGCTTGGCTGTTTTGGCGGATGAGAGAAGATTTTCAGCCTGATACAGATTAAATCAGAACGCAGAAGCGGTCTGATAAAACAGAATTTGCCTGGCGGCAGTAGCGCGGTGGTCCCACCTGACCCCATGCCGAACTCAGAAGTGAAACGCCGTAGCGCCGATGGTAGTGTGGGGTCTCCCCATGCGAGAGTAGGGAACTGCCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCCTGAGTAGGACAAATCCGCCGGGAGCGGATTTGAACGTTGCGAAGCAACGGCCCGGAGGGTGGCGGGCAGGACGCCCGCCATAAACTGCCAGGCATCAAATTAAGCAGAAGGCCATCCTGACGGATGGCCTTTTTGCGTTTCTACAAACTCTTTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCAGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGNTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTTCCCGTATCGNANTTNNCTACACGAN
您還可以使用range而不是while循環,並將字母傳遞給SeQIO:
for record in SeqIO.parse(myfile, "fasta",generic_dna)
...
for x in range(3):
....
這應該更接近您想要的:
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
folder = raw_input().replace(" ","")
files = glob.glob(folder + "/" + '*.seq')
for i in files:
with open(i) as myfile:
seq = next(SeqIO.parse(myfile, "fasta", generic_dna)).seq
qes = seq[::-1]
with open("{}.tran".format(i), "w+") as output:
for x in range(3):
cd = seq[x:]
cdr = qes[x:]
error = seq.translate().count('*')
reverror = cdr.translate().count('*')
output.write("{}\nstops: {}\n\n".format(cd.translate(), error))
output.write("Reverse: {}\nStops: {}\n\n ".format(cdr.translate(), reverror))
哪個輸出:
XXXXXXXFLSQLSTVSPYPFFWASEFELEIILFNFKKEIYI*L**NELLPCLLYC*WAGR*MPGQAWLFWRMREDFQPDTD*IRTQKRSDKTEFAWRQ*RGGPT*PHAELRSETP*RRW*CGVSPCESRELPGIK*NERLSRKTGPFVLSVVCR*TLS*VGQIRRERI*TLRSNGPEGGGQDARHKLPGIKLSRRPS*RMAFLRFYKLFCLFF*IHSNMYPLMRQ*P**MLQ*Y*KRKSMSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGAANY*LANYLL*LPGNN**TGWRRIKLQDHFCARPFRLAGFIADKSGAGERXLAVSLQHWGQMVSPSRIXXXYT
stops: 25
Reverse: XHIXXXYALPEW*TGVTTLLWRSXASGRGLNSRYLGRSAFPARVFTRTLK*AEVGQIINNGPSISFIKRSIIKRRGLTRSRRK*KWSQRPTRFCPSVLRRFFPYSRCAFTTYEYEKEKVIITS*IVPITEYSPMYKLT*IFLFVFSNIFAFFR*AVLPEDELNYGPSNTARRTGGGRPGNEALQV*ARAA*TG*VLSQVAVCCLFCFPGQKADSESKINYGPSRDESVPLWGVMVAAMPQSEDSSRTPVHPGGAMTAVRLRQNSLAKTQD*IRHSPTFRRE*AVLSVRTGP*MAGRVVVMSFVPLLSKVMLVYI*RKNFNLF**SSSLSDRVFLPIPLCHLSTLFFXXXXXX
Stops: 15
XXXXXXAFYRNSLLFLHTRFFGLANSSSK*FCLTLRRRYTYDCNETSYYPVCCTADGLVGKCLVKLGCFGG*EKIFSLIQIKSERRSGLIKQNLPGGSSAVVPPDPMPNSEVKRRSADGSVGSPHARVGNCQASNKTKGSVERLGLSFYLLFVGERSPE*DKSAGSGFERCEATARRVAGRTPAINCQASN*AEGHPDGWPFCVSTNSFVYFSKYIQICIRS*DNNPDKCFNNIEKGRV*VFNISVSPLFPFLRHFAFLFLLTQKRW*K*KMLKISWVQQTINWRTTYSSFPATINRLDGGG*SCRTTSALGPSGWLGLLLINLEPVSVXSRYHCSTGARW*ALPVSXXXTR
stops: 25
Reverse: STSXXAMPFPNGRPGSRRYYGALVRVAEV*IVVIWVGRPSRLASSPGR*NRRR*VR*LTTALRSHSSSGQLSNDVG*LEVVENESGRKDPLVFVLPFYGVFSLIPAVPLQLMSMRRKKL**LRK*SQ*QSTRLCINLHKSFYLFSQTSLRFSGRQSYRKTN*TTDRQIPPAGRAVGGPATKRCKFRRGPPKQDESSRKWLFVVYFAFRVRKLTRKAK*TTDRQGMRAYPSGV*W*PRCRKVKTQAVPQSTLVAR*RRSV*DKIVWRRRKTKLDIVRLLEESRRFCRFELVRKWLVG*SSCRLSHY*AK*C*YTYRGRISICFNKARA*AIGFFCPYLFVISQRYFSXXXXXX
Stops: 20
XXXXXXLFIATLYCFSIPVFLG*RIRARNNFV*L*EGDIHMIVMKRVITLFAVLLMGWSVNAWSSLAVLADERRFSA*YRLNQNAEAV**NRICLAAVARWSHLTPCRTQK*NAVAPMVVWGLPMRE*GTARHQIKRKAQSKDWAFRFICCLSVNALLSRTNPPGADLNVAKQRPGGWRAGRPP*TARHQIKQKAILTDGLFAFLQTLLFIFLNTFKYVSAHETITLINASIILKKEEYEYSTFPCRPYSLFCGILPSCFCSPRNAGESKRC*RSVGCSKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWVYC**IWSR*AXSRGIIAALGPDGKPFPYRXXLHX
stops: 25
Reverse: AHXXXLCPSRMVDRGHDVTMALXCEWPRSK*SLFGSVGLPGSRLHQDVEIGGGRSDN*QRPFDLIHQAVNYQTTWVD*KS*KMKVVAKTHSFLSFRFTAFFPLFPLCLYNL*V*EGKSYNNFVNSPNNRVLAYV*TYINLFICFLKHLCVFPVGSPTGRRIKLRTVKYRPQDGRWEARQRSVASLGEGRLNRMSPLASGCLLSILLSGSES*LGKQNKLRTVKG*ERTPLGCDGSRDAAK*RLKPYPSPPWWRDDGGPFKTK*SGEDARLN*T*SDF*KRVGGFVGSNWSVNGWSGSRHVVCPIIEQSNVSIHIEEEFQFVLIKLELKRSGFFAHTSLSSLNAIFRXXXXXX
Stops: 14
盡管出現警告,因為您的序列不是三的倍數
starting = True
lines = []
with open (i, "r") as myfile:
for line in myfile:
if starting:
starting = False
continue
lines.append(line)
# now you have all the lines except the first one in "lines"
當您閱讀文件時,請檢查該行的第一個字符是否為'>'。 如果是這樣,則只需閱讀該行並跳過它。 就像上面提到的注釋一樣,您可能希望在每次讀一行時都進行檢查,以便知道您是在處理新序列還是相同序列,因為您的文件可能具有多個序列。
該鏈接將為您提供有關讀取python文件的所有文檔。 https://docs.python.org/2/tutorial/inputoutput.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.