繁体   English   中英

是否可以逐行读取文件,同时也跳过给定行数的Python

[英]Is it possible to read a file line-by-line in while also skipping a given number of lines Python

我正在尝试用python编写一个程序,该程序解析出与输入文件到一系列输出文件中的某些条件匹配的数据行。

该程序读取一个输入文件,其中包含染色体上基因的起始和终止位置。 对于此输入文件的每一行,它将逐行打开第二个输入文件,其中包含目标染色体上已知SNP的位置。 如果SNP位于要迭代的基因的起始位置和终止位置之间,则将其复制到新文件中。

目前我的程序存在的问题是效率低下。 对于每个被分析的基因,程序会从第一行开始读取SNP数据的输入文件,并且直到到达位于大于(即具有更高位置编号)染色体位置的SNP的SNP时,程序才会停止。停止正在迭代的基因的位置。 由于所有基因和SNP数据都是按染色体位置排序的,因此,如果对于每个被迭代的基因,我都能以某种方式“告诉”我的程序以开始读取SNP位置数据的输入文件,则程序的速度和效率将会大大提高。从上次迭代中读取的最后一行开始; 而不是从文件的第一行开始。

有什么办法可以做这个Python吗? 还是必须从第一行读取所有文件?

到目前为止,我的代码如下。 任何建议将不胜感激。

import sys
import fileinput
import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

i=0
for i in range(i,n):
    x=i
    L=shlex.shlex(geneCoordinates[x],posix=True)
    L.whitespace += ','
    L.whitespace_split = True
    L=list(L)
    output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
    geneStart=int(L[2])
    geneStop=int(L[3])
    for line in fileinput.input("SNPs.txt"):
        if not fileinput.isfirstline():
            nSNPs=0
            SNP=shlex.shlex(line,posix=True)
            SNP.whitespace += '\t'
            SNP.whitespace_split = True
            SNP=list(SNP)
            SNPlocation=int(SNP[0])
            if SNPlocation < geneStart:
                continue
            if SNPlocation >= geneStart:
                if SNPlocation <= geneStop:
                    nSNPs=nSNPs+1
                    output.write(str(SNP))
                    output.write("\n")
            else: break
    nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))

只需使用迭代器(在循环外部的作用域内)来跟踪您在第二个文件中的位置。 它看起来应该像这样:

import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

i=0

#NEW CODE - 2 lines added.  By opening a file iterator outside of the loop, we can remember our position in it
SNP_file = open("SNPs.txt")
SNP_file.readline() #chomp up the first line, so we don't have to constantly check we're not at the beginning
#end new code.


for i in range(i,n):

   x=i
   L=shlex.shlex(geneCoordinates[x],posix=True)
   L.whitespace += ','
   L.whitespace_split = True
   L=list(L)
   output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
   geneStart=int(L[2])
   geneStop=int(L[3])

   #NEW CODE - deleted 2 lines, added 4
   #loop until break
   While 1:
      line = SNP_file.readLine()
      if not line: #exit loop if end of file reached
         break
      #end new code - the rest of your loop should behave normally

      nSNPs=0
      SNP=shlex.shlex(line,posix=True)
      SNP.whitespace += '\t'
      SNP.whitespace_split = True
      SNP=list(SNP)
      SNPlocation=int(SNP[0])
      if SNPlocation < geneStart:
          continue
      #NEW CODE - 1 line changed
      else: #if SNPlocation >= geneStart: 
      #logic dictates that if SNPLocation is not < geneStart, then it MUST be >= genestart. so ELSE is sufficient
          if SNPlocation <= geneStop:
              nSNPs=nSNPs+1
              output.write(str(SNP))
              output.write("\n")
              #NEW CODE 1 line added- need to exit this loop once we have found a match.
              #NOTE - your old code would return the LAST match. new code returns the FIRST match.
              #assuming there is only 1 match this won't matter... but I'm not sure if that assumption is true.
              break
      #NEW CODE - 1 line deleted
      #else: break else nolonger required. there are only two possible options.

      j = j+1
   nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM