[英]Is it possible to read a file line-by-line in while also skipping a given number of lines Python
我正在尝试用python编写一个程序,该程序解析出与输入文件到一系列输出文件中的某些条件匹配的数据行。
该程序读取一个输入文件,其中包含染色体上基因的起始和终止位置。 对于此输入文件的每一行,它将逐行打开第二个输入文件,其中包含目标染色体上已知SNP的位置。 如果SNP位于要迭代的基因的起始位置和终止位置之间,则将其复制到新文件中。
目前我的程序存在的问题是效率低下。 对于每个被分析的基因,程序会从第一行开始读取SNP数据的输入文件,并且直到到达位于大于(即具有更高位置编号)染色体位置的SNP的SNP时,程序才会停止。停止正在迭代的基因的位置。 由于所有基因和SNP数据都是按染色体位置排序的,因此,如果对于每个被迭代的基因,我都能以某种方式“告诉”我的程序以开始读取SNP位置数据的输入文件,则程序的速度和效率将会大大提高。从上次迭代中读取的最后一行开始; 而不是从文件的第一行开始。
有什么办法可以做这个Python吗? 还是必须从第一行读取所有文件?
到目前为止,我的代码如下。 任何建议将不胜感激。
import sys
import fileinput
import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')
i=0
for i in range(i,n):
x=i
L=shlex.shlex(geneCoordinates[x],posix=True)
L.whitespace += ','
L.whitespace_split = True
L=list(L)
output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
geneStart=int(L[2])
geneStop=int(L[3])
for line in fileinput.input("SNPs.txt"):
if not fileinput.isfirstline():
nSNPs=0
SNP=shlex.shlex(line,posix=True)
SNP.whitespace += '\t'
SNP.whitespace_split = True
SNP=list(SNP)
SNPlocation=int(SNP[0])
if SNPlocation < geneStart:
continue
if SNPlocation >= geneStart:
if SNPlocation <= geneStop:
nSNPs=nSNPs+1
output.write(str(SNP))
output.write("\n")
else: break
nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))
只需使用迭代器(在循环外部的作用域内)来跟踪您在第二个文件中的位置。 它看起来应该像这样:
import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')
i=0
#NEW CODE - 2 lines added. By opening a file iterator outside of the loop, we can remember our position in it
SNP_file = open("SNPs.txt")
SNP_file.readline() #chomp up the first line, so we don't have to constantly check we're not at the beginning
#end new code.
for i in range(i,n):
x=i
L=shlex.shlex(geneCoordinates[x],posix=True)
L.whitespace += ','
L.whitespace_split = True
L=list(L)
output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
geneStart=int(L[2])
geneStop=int(L[3])
#NEW CODE - deleted 2 lines, added 4
#loop until break
While 1:
line = SNP_file.readLine()
if not line: #exit loop if end of file reached
break
#end new code - the rest of your loop should behave normally
nSNPs=0
SNP=shlex.shlex(line,posix=True)
SNP.whitespace += '\t'
SNP.whitespace_split = True
SNP=list(SNP)
SNPlocation=int(SNP[0])
if SNPlocation < geneStart:
continue
#NEW CODE - 1 line changed
else: #if SNPlocation >= geneStart:
#logic dictates that if SNPLocation is not < geneStart, then it MUST be >= genestart. so ELSE is sufficient
if SNPlocation <= geneStop:
nSNPs=nSNPs+1
output.write(str(SNP))
output.write("\n")
#NEW CODE 1 line added- need to exit this loop once we have found a match.
#NOTE - your old code would return the LAST match. new code returns the FIRST match.
#assuming there is only 1 match this won't matter... but I'm not sure if that assumption is true.
break
#NEW CODE - 1 line deleted
#else: break else nolonger required. there are only two possible options.
j = j+1
nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.