從python中的大型文本文件中有效讀取部分

Question

我有一個大文本文件，其中包含幾百萬行數據。 第一列包含位置坐標。 我需要從該原始數據創建另一個文件，但是該文件僅包含基於位置坐標的指定非連續間隔。 我還有另一個文件，其中包含每個間隔的坐標。 例如，我的原始文件的格式與此類似：

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d  
73         a       b       c      d 
75         a       b       c      d
82         a       b       c      d
86         a       b       c      d

然后說我的文件包含看起來像這樣的間隔...

name1   50   72
name2   78   93

然后我希望我的新文件看起來像這樣...

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d 
82         a       b       c      d
86         a       b       c      d

到目前為止，我已經創建了一個函數，用於將特定間隔內包含的原始文件中的數據寫入新文件。 我的代碼如下：

def get_block(beg,end):
   output=open(output_table,'a')
   with open(input_table,'r') as f:
      for line in f:
         line=line.strip("\r\n")
         line=line.split("\t")
         position=int(line[0])
         if int(position)<=beg:
            pass
         elif int(position)>=end:
            break
         else:
            for i in line:
               output.write(("%s\t")%(i))
            output.write("\n")

然后，我創建一個包含我的間隔對的列表，然后使用上述函數循環遍歷我的原始文件，如下所示：

#coords=[[start1,stop1],[start2,stop2],[start3,stop3]..etc]
for i in coords:
   start_p=int(i[0]) ; stop_p=int(i[1])
   get_block(start_p,stop_p)

這執行了我想要的操作，但是隨着它沿着我的坐標列表移動，它的速度成倍地變慢，因為我必須遍歷整個文件，直到每次通過循環到達指定的起始坐標為止。 有沒有更有效的方法來做到這一點？ 有沒有一種方法可以每次都跳到特定行而不是逐行閱讀？

Answer 1

感謝您提出使用pandas的建議。 以前，我的原始代碼已經運行了大約18個小時，並且僅完成了一半。 使用pandas ，它在5分鍾內創建了我想要的文件。 供以后參考，如果其他人有類似的任務，這是我使用的代碼。

import pandas as pd

data=pd.io.parsers.read_csv(input_table,delimiter="\t")
for i in coords:
   start_p=int(i[0]);stop_p=int(i[1])
   df=data[((data.POSITION>=start_p)&(data.POSITION<=stop_p))]
   df.to_csv(output_table,index=False,sep="\t",header=False,cols=None,mode='a')

Answer 2

我只是使用內置的csv模塊來簡化讀取輸入。 為了進一步加快處理速度，可以一次讀取所有坐標范圍，這將允許選擇過程一次通過數據文件進行。

import csv

# read all coord ranges into memory
with open('ranges', 'rb') as ranges:
    range_reader = csv.reader(ranges, delimiter='\t')
    coords = [map(int, (start, stop)) for name,start,stop in range_reader]

# make one pass through input file and extract positions specified
with open('output_table', 'w') as outf, open('input_table', 'rb') as inf:
    input_reader = csv.reader(inf, delimiter='\t')
    outf.write('\t'.join(input_reader.next())+'\n')  # copy header row
    for row in input_reader:
        for coord in coords:
            if coord[0] <= int(row[0]) <= coord[1]:
                outf.write('\t'.join(row)+'\n')
                break;

從python中的大型文本文件中有效讀取部分

問題描述

2 個解決方案

解決方案1
0 2013-05-22 17:02:17

解決方案2
0 2013-05-22 17:11:33

從python中的大型文本文件中有效讀取部分

問題描述

2 個解決方案

解決方案1 0 2013-05-22 17:02:17

解決方案2 0 2013-05-22 17:11:33

解決方案1
0 2013-05-22 17:02:17

解決方案2
0 2013-05-22 17:11:33