简体   繁体   English

使用python搜索极大的文本文件

[英]using python to search extremely large text file

I have a large 40 million line, 3 gigabyte text file (probably wont be able to fit in memory) in the following format: 我有一个4000万行,3千兆字节的文本文件(可能无法适应内存),格式如下:

399.4540176 {Some other data}
404.498759292 {Some other data}
408.362737492 {Some other data}
412.832976111 {Some other data}
415.70665675 {Some other data}
419.586515381 {Some other data}
427.316825959 {Some other data}
.......

Each line starts off with a number and is followed by some other data. 每行以数字开头,后跟一些其他数据。 The numbers are in sorted order. 数字按排序顺序排列。 I need to be able to: 我需要能够:

  1. Given a number x and and a range y , find all the lines whose number is within y range of x . 给定一个数x和以及一系列y ,发现其数量范围内的所有行y的范围x For example if x=20 and y=5 , I need to find all lines whose number is between 15 and 25 . 例如,如果x=20y=5 ,我需要找到其数量在1525之间的所有行。
  2. Store these lines into another separate file. 将这些行存储到另一个单独的文件中

What would be an efficient method to do this without having to trawl through the entire file? 在不必遍历整个文件的情况下,这样做的有效方法是什么?

If you don't want to generate a database ahead of time for line lengths, you can try this: 如果您不想提前为行长度生成数据库,可以尝试这样做:

import os
import sys

# Configuration, change these to suit your needs
maxRowOffset = 100  #increase this if some lines are being missed
fileName = 'longFile.txt'
x = 2000
y = 25

#seek to first character c before the current position
def seekTo(f,c):
    while f.read(1) != c:
        f.seek(-2,1)

def parseRow(row):
    return (int(row.split(None,1)[0]),row)

minRow = x - y
maxRow = x + y
step = os.path.getsize(fileName)/2.
with open(fileName,'r') as f:
    while True:
        f.seek(int(step),1)
        seekTo(f,'\n')
        row = parseRow(f.readline())
        if row[0] < minRow:
            if minRow - row[0] < maxRowOffset:
                with open('outputFile.txt','w') as fo:
                    for row in f:
                        row = parseRow(row)
                        if row[0] > maxRow:
                            sys.exit()
                        if row[0] >= minRow:
                            fo.write(row[1])
            else:
                step /= 2.
                step = step * -1 if step < 0 else step
        else:
            step /= 2.
            step = step * -1 if step > 0 else step

It starts by performing a binary search on the file until it is near (less than maxRowOffset ) the row to find. 它首先对文件执行二进制搜索,直到它接近(小于maxRowOffset )要查找的行。 Then it starts reading every line until it finds one that is greater than xy . 然后它开始读取每一行,直到找到一个大于xy That line, and every line after it are written to an output file until a line is found that is greater than x+y , and which point the program exits. 该行以及它之后的每一行都被写入输出文件,直到找到大于x+y ,以及程序退出的那一行。

I tested this on a 1,000,000 line file and it runs in 0.05 seconds. 我在1,000,000行文件上对此进行了测试,并在0.05秒内运行。 Compare this to reading every line which took 3.8 seconds. 相比之下,读取每条花了3.8秒。

You need random access to the lines which you won't get with a text files unless the lines are all padded to the same length. 您需要随机访问文本文件无法获得的行,除非这些行都填充到相同的长度。

One solution is to dump the table into a database (such as SQLite) with two columns, one for the number and one for all the other data (assuming that the data is guaranteed to fit into whatever the maximum number of characters allowed in a single column in your database is). 一种解决方案是将表转储到具有两列的数据库(例如SQLite)中,一列用于数字,一列用于所有其他数据(假设数据保证适合单个字符中允许的最大字符数)数据库中的列是)。 Then index the number column and you're good to go. 然后索引数字列,你很高兴。

Without a database, you could read through file one time and create an in-memory data structure with pairs of values showing containing (number, line-offset). 如果没有数据库,您可以读取文件一次并创建一个内存数据结构,其中的值对显示包含(数字,行偏移)。 You calculate the line-offset by adding the lengths of each row (including line end). 您可以通过添加每行的长度(包括行结束)来计算行偏移量。 Now you can binary search these value pairs on number and randomly access the lines in the file using the offset. 现在,您可以在数字上二进制搜索这些值对,并使用偏移量随机访问文件中的行。 If you need to repeat the search later, pickle the in-memory structure and reload for later re-use. 如果您需要稍后重复搜索,请选择内存中的结构并重新加载以供以后重复使用。

This reads the entire file (which you said you don't want to do), but does so only once to build the index. 这会读取整个文件(您说您不想这样做),但只能执行一次以构建索引。 After that you can execute as many requests against the file as you want and they will be very fast. 之后,您可以根据需要对文件执行任意数量的请求,并且速度非常快。

Note that this second solution is essentially creating a database index on your text file. 请注意,第二个解决方案实际上是在文本文件上创建数据库索引。

Rough code to create the index in second solution: 在第二个解决方案中创建索引的粗略代码:

 import Pickle

 line_end_length = len('\n') # must be a better way to do this!
 offset = 0
 index = [] # probably a better structure to use than a list

 f = open(filename)
 for row in f:
     nbr = float(row.split(' ')[0])
     index.append([nbr, offset])
     offset += len(row) + line_end_length

 Pickle.dump(index, open('filename.idx', 'wb')) # saves it for future use

Now, you can perform a binary search on the list. 现在,您可以在列表上执行二进制搜索。 There's probably a much better data structure to use for accruing the index values than a list, but I'd have to read up on the various collection types. 可能有一个更好的数据结构用于累积索引值而不是列表,但我必须阅读各种集合类型。

Since you want to match the first field, you can use gawk : 由于您想匹配第一个字段,您可以使用gawk

$ gawk '{if ($1 >= 15 && $1 <= 25) { print }; if ($1 > 25) { exit }}' your_file

Edit: Taking a file with 261,775,557 lines that is 2.5 GiB big, searching for lines 50,010,015 to 50,010,025 this takes 27 seconds on my Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz . 编辑:获取261,775,557行2.5 GiB大的文件,搜索50,010,01550,010,025行,这需要27秒才能在我的Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz Sounds good enough for me. 对我来说听起来不错。

In order to find the line that starts with the number just above your lower limit, you have to go through the file line by line until you find that line. 为了找到以低于下限的数字开头的行,您必须逐行浏览文件,直到找到该行。 No other way, ie all data in the file has to be read and parsed for newline characters. 没有其他办法,即必须为换行符读取和解析文件中的所有数据。

We have to run this search up to the first line that exceeds your upper limit and stop. 我们必须将此搜索运行到超出上限的第一行并停止。 Hence, it helps that the file is already sorted. 因此,它有助于文件已经排序。 This code will hopefully help: 这段代码有望帮助:

with open(outpath) as outfile:
    with open(inpath) as infile:
        for line in infile:
            t = float(line.split()[0])
            if lower_limit <= t <= upper_limit:
                outfile.write(line)
            elif t > upper_limit:
                break

I think theoretically there is no other option. 我认为理论上没有其他选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM