简体   繁体   中英

reading sections from a large text file in python efficiently

I have a large text file containing several million lines of data. The very first column contains position coordinates. I need to create another file from this original data, but that only contains specified non-contiguous intervals based on the position coordinates. I have another file containing the coordinates for each interval. For instance, my original file is in a format similar to this:

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d  
73         a       b       c      d 
75         a       b       c      d
82         a       b       c      d
86         a       b       c      d

Then lets say I have my file containing intervals that looks something like this...

name1   50   72
name2   78   93

Then I want my new file to look something like this...

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d 
82         a       b       c      d
86         a       b       c      d

So far I have created a function to write the data from the original file contained within a specific interval to my new file. My code is as follows:

def get_block(beg,end):
   output=open(output_table,'a')
   with open(input_table,'r') as f:
      for line in f:
         line=line.strip("\r\n")
         line=line.split("\t")
         position=int(line[0])
         if int(position)<=beg:
            pass
         elif int(position)>=end:
            break
         else:
            for i in line:
               output.write(("%s\t")%(i))
            output.write("\n")

I then create a list containing the pairs of my intervals and then loop through my original file using the above function like this:

#coords=[[start1,stop1],[start2,stop2],[start3,stop3]..etc]
for i in coords:
   start_p=int(i[0]) ; stop_p=int(i[1])
   get_block(start_p,stop_p)

This performs what I want, however it gets exponentially slower as it moves along my coordinate list because I am having to read through my entire file until I reach the specified start coordinate each time through the loop. Is there a more efficient way of accomplishing this? Is there a way to skip to a specific line each time instead of reading over every line?

Thanks for the suggestions to use pandas . Previously, my original code had been running for about 18 hours and was only half way finished. Using pandas , it created my desired file in under 5 mins. For future reference and if anyone else has a similar task, here is the code that I used.

import pandas as pd

data=pd.io.parsers.read_csv(input_table,delimiter="\t")
for i in coords:
   start_p=int(i[0]);stop_p=int(i[1])
   df=data[((data.POSITION>=start_p)&(data.POSITION<=stop_p))]
   df.to_csv(output_table,index=False,sep="\t",header=False,cols=None,mode='a')

I'd just use the built-in csv module to simplify reading the input. To further speed things up, all the coord ranges could be read in at once, which would allow the selection process to occur in one pass through the data file.

import csv

# read all coord ranges into memory
with open('ranges', 'rb') as ranges:
    range_reader = csv.reader(ranges, delimiter='\t')
    coords = [map(int, (start, stop)) for name,start,stop in range_reader]

# make one pass through input file and extract positions specified
with open('output_table', 'w') as outf, open('input_table', 'rb') as inf:
    input_reader = csv.reader(inf, delimiter='\t')
    outf.write('\t'.join(input_reader.next())+'\n')  # copy header row
    for row in input_reader:
        for coord in coords:
            if coord[0] <= int(row[0]) <= coord[1]:
                outf.write('\t'.join(row)+'\n')
                break;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM