简体   繁体   English

从python中的大型文本文件中有效读取部分

[英]reading sections from a large text file in python efficiently

I have a large text file containing several million lines of data. 我有一个大文本文件,其中包含几百万行数据。 The very first column contains position coordinates. 第一列包含位置坐标。 I need to create another file from this original data, but that only contains specified non-contiguous intervals based on the position coordinates. 我需要从该原始数据创建另一个文件,但是该文件仅包含基于位置坐标的指定非连续间隔。 I have another file containing the coordinates for each interval. 我还有另一个文件,其中包含每个间隔的坐标。 For instance, my original file is in a format similar to this: 例如,我的原始文件的格式与此类似:

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d  
73         a       b       c      d 
75         a       b       c      d
82         a       b       c      d
86         a       b       c      d

Then lets say I have my file containing intervals that looks something like this... 然后说我的文件包含看起来像这样的间隔...

name1   50   72
name2   78   93

Then I want my new file to look something like this... 然后我希望我的新文件看起来像这样...

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d 
82         a       b       c      d
86         a       b       c      d

So far I have created a function to write the data from the original file contained within a specific interval to my new file. 到目前为止,我已经创建了一个函数,用于将特定间隔内包含的原始文件中的数据写入新文件。 My code is as follows: 我的代码如下:

def get_block(beg,end):
   output=open(output_table,'a')
   with open(input_table,'r') as f:
      for line in f:
         line=line.strip("\r\n")
         line=line.split("\t")
         position=int(line[0])
         if int(position)<=beg:
            pass
         elif int(position)>=end:
            break
         else:
            for i in line:
               output.write(("%s\t")%(i))
            output.write("\n")

I then create a list containing the pairs of my intervals and then loop through my original file using the above function like this: 然后,我创建一个包含我的间隔对的列表,然后使用上述函数循环遍历我的原始文件,如下所示:

#coords=[[start1,stop1],[start2,stop2],[start3,stop3]..etc]
for i in coords:
   start_p=int(i[0]) ; stop_p=int(i[1])
   get_block(start_p,stop_p)

This performs what I want, however it gets exponentially slower as it moves along my coordinate list because I am having to read through my entire file until I reach the specified start coordinate each time through the loop. 这执行了我想要的操作,但是随着它沿着我的坐标列表移动,它的速度成倍地变慢,因为我必须遍历整个文件,直到每次通过循环到达指定的起始坐标为止。 Is there a more efficient way of accomplishing this? 有没有更有效的方法来做到这一点? Is there a way to skip to a specific line each time instead of reading over every line? 有没有一种方法可以每次都跳到特定行而不是逐行阅读?

Thanks for the suggestions to use pandas . 感谢您提出使用pandas的建议。 Previously, my original code had been running for about 18 hours and was only half way finished. 以前,我的原始代码已经运行了大约18个小时,并且仅完成了一半。 Using pandas , it created my desired file in under 5 mins. 使用pandas ,它在5分钟内创建了我想要的文件。 For future reference and if anyone else has a similar task, here is the code that I used. 供以后参考,如果其他人有类似的任务,这是我使用的代码。

import pandas as pd

data=pd.io.parsers.read_csv(input_table,delimiter="\t")
for i in coords:
   start_p=int(i[0]);stop_p=int(i[1])
   df=data[((data.POSITION>=start_p)&(data.POSITION<=stop_p))]
   df.to_csv(output_table,index=False,sep="\t",header=False,cols=None,mode='a')

I'd just use the built-in csv module to simplify reading the input. 我只是使用内置的csv模块来简化读取输入。 To further speed things up, all the coord ranges could be read in at once, which would allow the selection process to occur in one pass through the data file. 为了进一步加快处理速度,可以一次读取所有坐标范围,这将允许选择过程一次通过数据文件进行。

import csv

# read all coord ranges into memory
with open('ranges', 'rb') as ranges:
    range_reader = csv.reader(ranges, delimiter='\t')
    coords = [map(int, (start, stop)) for name,start,stop in range_reader]

# make one pass through input file and extract positions specified
with open('output_table', 'w') as outf, open('input_table', 'rb') as inf:
    input_reader = csv.reader(inf, delimiter='\t')
    outf.write('\t'.join(input_reader.next())+'\n')  # copy header row
    for row in input_reader:
        for coord in coords:
            if coord[0] <= int(row[0]) <= coord[1]:
                outf.write('\t'.join(row)+'\n')
                break;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM