[英]Is it possible to read a file line-by-line in while also skipping a given number of lines Python
我正在嘗試用python編寫一個程序,該程序解析出與輸入文件到一系列輸出文件中的某些條件匹配的數據行。
該程序讀取一個輸入文件,其中包含染色體上基因的起始和終止位置。 對於此輸入文件的每一行,它將逐行打開第二個輸入文件,其中包含目標染色體上已知SNP的位置。 如果SNP位於要迭代的基因的起始位置和終止位置之間,則將其復制到新文件中。
目前我的程序存在的問題是效率低下。 對於每個被分析的基因,程序會從第一行開始讀取SNP數據的輸入文件,並且直到到達位於大於(即具有更高位置編號)染色體位置的SNP的SNP時,程序才會停止。停止正在迭代的基因的位置。 由於所有基因和SNP數據都是按染色體位置排序的,因此,如果對於每個被迭代的基因,我都能以某種方式“告訴”我的程序以開始讀取SNP位置數據的輸入文件,則程序的速度和效率將會大大提高。從上次迭代中讀取的最后一行開始; 而不是從文件的第一行開始。
有什么辦法可以做這個Python嗎? 還是必須從第一行讀取所有文件?
到目前為止,我的代碼如下。 任何建議將不勝感激。
import sys
import fileinput
import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')
i=0
for i in range(i,n):
x=i
L=shlex.shlex(geneCoordinates[x],posix=True)
L.whitespace += ','
L.whitespace_split = True
L=list(L)
output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
geneStart=int(L[2])
geneStop=int(L[3])
for line in fileinput.input("SNPs.txt"):
if not fileinput.isfirstline():
nSNPs=0
SNP=shlex.shlex(line,posix=True)
SNP.whitespace += '\t'
SNP.whitespace_split = True
SNP=list(SNP)
SNPlocation=int(SNP[0])
if SNPlocation < geneStart:
continue
if SNPlocation >= geneStart:
if SNPlocation <= geneStop:
nSNPs=nSNPs+1
output.write(str(SNP))
output.write("\n")
else: break
nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))
只需使用迭代器(在循環外部的作用域內)來跟蹤您在第二個文件中的位置。 它看起來應該像這樣:
import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')
i=0
#NEW CODE - 2 lines added. By opening a file iterator outside of the loop, we can remember our position in it
SNP_file = open("SNPs.txt")
SNP_file.readline() #chomp up the first line, so we don't have to constantly check we're not at the beginning
#end new code.
for i in range(i,n):
x=i
L=shlex.shlex(geneCoordinates[x],posix=True)
L.whitespace += ','
L.whitespace_split = True
L=list(L)
output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
geneStart=int(L[2])
geneStop=int(L[3])
#NEW CODE - deleted 2 lines, added 4
#loop until break
While 1:
line = SNP_file.readLine()
if not line: #exit loop if end of file reached
break
#end new code - the rest of your loop should behave normally
nSNPs=0
SNP=shlex.shlex(line,posix=True)
SNP.whitespace += '\t'
SNP.whitespace_split = True
SNP=list(SNP)
SNPlocation=int(SNP[0])
if SNPlocation < geneStart:
continue
#NEW CODE - 1 line changed
else: #if SNPlocation >= geneStart:
#logic dictates that if SNPLocation is not < geneStart, then it MUST be >= genestart. so ELSE is sufficient
if SNPlocation <= geneStop:
nSNPs=nSNPs+1
output.write(str(SNP))
output.write("\n")
#NEW CODE 1 line added- need to exit this loop once we have found a match.
#NOTE - your old code would return the LAST match. new code returns the FIRST match.
#assuming there is only 1 match this won't matter... but I'm not sure if that assumption is true.
break
#NEW CODE - 1 line deleted
#else: break else nolonger required. there are only two possible options.
j = j+1
nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.