简体   繁体   English

如何跳转到巨大文本文件中的特定行?

[英]How to jump to a particular line in a huge text file?

Are there any alternatives to the code below:下面的代码是否有任何替代方案:

startFromLine = 141978 # or whatever line I need to jump to

urlsfile = open(filename, "rb", 0)

linesCounter = 1

for line in urlsfile:
    if linesCounter > startFromLine:
        DoSomethingWithThisLine(line)

    linesCounter += 1

If I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance?如果我正在处理一个包含未知但长度不同的行的巨大文本文件(~15MB) ,并且需要跳转到我提前知道的特定行? I feel bad by processing them one by one when I know I could ignore at least first half of the file.当我知道我至少可以忽略文件的前半部分时,我会通过一个一个地处理它们而感到难过。 Looking for more elegant solution if there is any.如果有的话,寻找更优雅的解决方案。

You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are.如果不至少读取一次文件,您将无法继续前进,因为您不知道换行符在哪里。 You could do something like:您可以执行以下操作:

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])

linecache :线缓存

The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. linecache模块允许从 Python 源文件中获取任何行,同时尝试使用缓存进行内部优化,这是从单个文件中读取多行的常见情况。 This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback... traceback模块使用它来检索源代码行以包含在格式化的回溯中...

You don't really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you've progressed to the next line.如果行的长度不同,您实际上并没有那么多选择……遗憾的是,您需要处理行尾字符才能知道何时进入下一行。

You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to "open" to something not 0.但是,您可以通过将最后一个参数“open”更改为非 0 来显着加快速度并减少 memory 的使用。

0 means the file reading operation is unbuffered, which is very slow and disk intensive. 0 表示文件读取操作是无缓冲的,非常慢且占用大量磁盘空间。 1 means the file is line buffered, which would be an improvement. 1 表示文件是行缓冲的,这将是一个改进。 Anything above 1 (say 8k.. ie: 8096, or higher) reads chunks of the file into memory.任何高于 1(比如 8k .. 即:8096 或更高)的文件都会将文件块读取到 memory 中。 You still access it through for line in open(etc): , but python only goes a bit at a time, discarding each buffered chunk after its processed.您仍然可以通过for line in open(etc):访问它,但是 python 一次只运行一点,在处理后丢弃每个缓冲块。

I'm probably spoiled by abundant ram, but 15 M is not huge.我可能被丰富的ram宠坏了,但15 M并不大。 Reading into memory with readlines() is what I usually do with files of this size.使用readlines()读取 memory 是我通常对这种大小的文件所做的。 Accessing a line after that is trivial.之后访问一行是微不足道的。

I am suprised no one mentioned islice我很惊讶没有人提到 islice

line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line

or if you want the whole rest of the file或者如果你想要文件的整个 rest

rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
    print line

or if you want every other line from the file或者如果你想要文件中的每一行

rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
    print odd_line

Since there is no way to determine the lenght of all lines without reading them, you have no choice but to iterate over all lines before your starting line.由于不阅读就无法确定所有行的长度,因此您别无选择,只能遍历起始行之前的所有行。 All you can do is to make it look nice.你所能做的就是让它看起来不错。 If the file is really huge then you might want to use a generator based approach:如果文件真的很大,那么您可能需要使用基于生成器的方法:

from itertools import dropwhile

def iterate_from_line(f, start_from_line):
    return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))

for line in iterate_from_line(open(filename, "r", 0), 141978):
    DoSomethingWithThisLine(line)

Note: the index is zero based in this approach.注意:基于这种方法,索引为零。

If you don't want to read the entire file in memory.. you may need to come up with some format other than plain text.如果您不想阅读 memory.. 中的整个文件,您可能需要提供除纯文本之外的其他格式。

of course it all depends on what you're trying to do, and how often you will jump across the file.当然,这完全取决于您要执行的操作,以及您跳过文件的频率。

For instance, if you're gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:例如,如果您要在同一个文件中多次跳转到行,并且您知道文件在使用它时不会更改,您可以这样做:
First, pass through the whole file, and record the "seek-location" of some key-line-numbers (such as, ever 1000 lines),首先,遍历整个文件,并记录一些关键行号的“搜索位置”(例如,曾经 1000 行),
Then if you want line 12005, jump to the position of 12000 (which you've recorded) then read 5 lines and you'll know you're in line 12005 and so on然后如果你想要第 12005 行,跳到 12000 的 position(你已经记录)然后读 5 行,你就会知道你在第 12005 行等等

I have had the same problem (need to retrieve from huge file specific line).我遇到了同样的问题(需要从大文件的特定行中检索)。

Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows.当然,我每次都可以遍历文件中的所有记录并在计数器等于目标行时停止它,但是在您想要获得复数个特定行的情况下它不起作用。 That caused main issue to be resolved - how handle directly to necessary place of file.这导致主要问题得到解决 - 如何直接处理必要的文件位置。

I found out next decision: Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).我发现了下一个决定:首先我完成了字典,每行的开始 position (键是行号,值 - 前行的累积长度)。

t = open(file,’r’)
dict_pos = {}

kolvo = 0
length = 0
for each in t:
    dict_pos[kolvo] = length
    length = length+len(each)
    kolvo = kolvo+1

ultimately, aim function:最终,瞄准 function:

def give_line(line_number):
    t.seek(dict_pos.get(line_number))
    line = t.readline()
    return line

t.seek(line_number) – command that execute pruning of file up to line inception. t.seek(line_number) - 执行文件修剪的命令,直到行开始。 So, if you next commit readline – you obtain your target line.因此,如果您下一次提交 readline – 您将获得目标行。

Using such approach I have saved significant part of time.使用这种方法,我节省了大量时间。

What generates the file you want to process?什么会生成您要处理的文件? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to.如果它在您的控制之下,您可以在附加文件时生成一个索引(哪一行是 position。)。 The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller.索引文件可以是固定的行大小(空格填充或 0 填充数字)并且肯定会更小。 And thus can be read and processed qucikly.从而可以快速读取和处理。

  • Which line do you want?.你要哪条线?
  • Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).计算索引文件中对应行号的字节偏移量(可能是因为索引文件的行大小是恒定的)。
  • Use seek or whatever to directly jump to get the line from index file.使用 seek 或其他直接跳转以从索引文件中获取行。
  • Parse to get byte offset for corresponding line of actual file.解析以获取实际文件相应行的字节偏移量。

You may use mmap to find the offset of the lines.您可以使用 mmap 来查找线的偏移量。 MMap seems to be the fastest way to process a file MMap 似乎是处理文件的最快方法

example:例子:

with open('input_file', "r+b") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    i = 1
    for line in iter(mapped.readline, ""):
        if i == Line_I_want_to_jump:
            offsets = mapped.tell()
        i+=1

then use f.seek(offsets) to move to the line you need然后使用 f.seek(offsets) 移动到您需要的行

None of the answers are particularly satisfactory, so here's a small snippet to help.没有一个答案特别令人满意,所以这里有一个小片段可以提供帮助。

class LineSeekableFile:
    def __init__(self, seekable):
        self.fin = seekable
        self.line_map = list() # Map from line index -> file position.
        self.line_map.append(0)
        while seekable.readline():
            self.line_map.append(seekable.tell())

    def __getitem__(self, index):
        # NOTE: This assumes that you're not reading the file sequentially.  
        # For that, just use 'for line in file'.
        self.fin.seek(self.line_map[index])
        return self.fin.readline()

Example usage:示例用法:

In: !cat /tmp/test.txt

Out:
Line zero.
Line one!

Line three.
End of file, line four.

In:
with open("/tmp/test.txt", 'rt') as fin:
    seeker = LineSeekableFile(fin)    
    print(seeker[1])
Out:
Line one!

This involves doing a lot of file seeks, but is useful for the cases where you can't fit the whole file in memory.这涉及进行大量文件搜索,但对于无法将整个文件放入 memory 的情况很有用。 It does one initial read to get the line locations (so it does read the whole file, but doesn't keep it all in memory), and then each access does a file seek after the fact.它会进行一次初始读取以获取行位置(因此它会读取整个文件,但不会将其全部保存在内存中),然后每次访问都会在事后查找文件。

I offer the snippet above under the MIT or Apache license at the discretion of the user.我根据用户的判断在 MIT 或 Apache 许可证下提供上面的代码片段。

If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.如果您事先知道文件中的 position(而不是行号),则可以使用file.seek()到 go 到那个 position。

Edit : you can use thelinecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory.编辑:您可以使用linecache.getline(filename, lineno) function,它将返回 lineno 的内容,但只有在将整个文件读入 memory 之后。 Good if you're randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.如果您从文件中随机访问行(因为 python 本身可能想要打印回溯),那很好,但对于 15MB 的文件来说并不好。

If you're dealing with a text file & based on linux system , you could use the linux commands.如果您正在处理基于linux 系统文本文件,则可以使用 linux 命令。
For me, this worked well!对我来说,这很好用!

import commands

def read_line(path, line=1):
    return commands.getoutput('head -%s %s | tail -1' % (line, path))

line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)

Do the lines themselves contain any index information?这些行本身是否包含任何索引信息? If the content of each line was something like " <line index>:Data ", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable.如果每一行的内容类似于“ <line index>:Data ”,则可以使用seek()方法对文件进行二进制搜索,即使Data的数量是可变的。 You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.你会寻找文件的中点,读取一行,检查它的索引是高于还是低于你想要的,等等。

Otherwise, the best you can do is just readlines() .否则,你能做的最好的就是readlines() If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline() s with a smaller number of calls to readlines() .如果您不想读取所有 15MB,则可以使用sizehint参数至少将大量readline()替换为对readlines() () 的较少调用。

Here's an example using 'readlines(sizehint)' to read a chunk of lines at a time.这是一个使用“readlines(sizehint)”一次读取一大块行的示例。 DNS pointed out that solution. DNS 指出了该解决方案。 I wrote this example because the other examples here are single-line oriented.我写这个例子是因为这里的其他例子都是面向单行的。

def getlineno(filename, lineno):
    if lineno < 1:
        raise TypeError("First line is line 1")
    f = open(filename)
    lines_read = 0
    while 1:
        lines = f.readlines(100000)
        if not lines:
            return None
        if lines_read + len(lines) >= lineno:
            return lines[lineno-lines_read-1]
        lines_read += len(lines)

print getlineno("nci_09425001_09450000.smi", 12000)

@george brilliantly suggested mmap , which presumably uses the syscall mmap . @george 出色地建议mmap ,它可能使用系统调用mmap Here's another rendition.这是另一个演绎。

import mmap

LINE = 2  # your desired line

with open('data.txt','rb') as i_file, mmap.mmap(i_file.fileno(), length=0, prot=mmap.PROT_READ) as data:
  for i,line in enumerate(iter(data.readline, '')):
    if i!=LINE: continue
    pos = data.tell() - len(line)
    break

  # optionally copy data to `chunk`
  i_file.seek(pos)
  chunk = i_file.read(len(line))

print(f'line {i}')
print(f'byte {pos}')
print(f'data {line}')
print(f'data {chunk}')

Can use this function to return line n:可以使用这个 function 返回线 n:

def skipton(infile, n):
    with open(infile,'r') as fi:
        for i in range(n-1):
            fi.next()
        return fi.next()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在两个巨大的文本文件中跳转到同一行? - How to jump to the same line in two huge text files? 在文本文件很大的情况下,如何从特定的行开始读取文件,因为我无法从第一行进行迭代 - How to start reading a file from a particular line in the case of a huge text file as I cannot iterate from line one Python 如何替换文本文件中特定行中的特定单词? - Python How to replace a particular word in a particular line in a text file? 如何从文本文件中读取特定的兴趣线? - How to read a particular line of interest from a text file? 如何从 python 中的文本文件的特定行打印 - how to print from a particular line from text file in python 如何使用python从文本文件中提取特定行 - How to extract particular line from text file using python 如何在Python中的单独变量中存储文本文件特定行的值 - How to store values of a particular line of a text file in separate variables in Python 如何使用 Python 将一组文本插入文件的特定行号 - How to insert a set of text to a particular line number of a file using Python Python,在文本文件中查找行,并跳转到下一个'string-to-find' - Python, find line in text file, and jump to next 'string-to-find' 用python替换文本文件中的特定行? - replace particular line in text file using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM