简体   繁体   English

在Python中修改大型文本文件的最后一行的最有效方法

[英]Most efficient way to modify the last line of a large text file in Python

I need to update the last line from a few more than 2GB files made up of lines of text that can not be read with readlines() . 我需要从超过2GB的文件中更新最后一行,这些文件由readlines()无法读取的文本行组成。 Currently, it work fine by looping through line by line. 目前,它可以通过逐行循环工作。 However, I am wondering if there is any compiled library can achieve this more efficiently? 但是,我想知道是否有任何编译的库可以更有效地实现这一目标? Thanks! 谢谢!

Current approach 目前的方法

    myfile = open("large.XML")
    for line in myfile:
        do_something()

If this is really something line based (where a true XML parser isn't necessary the best solution), mmap can help here. 如果这确实是基于行的内容(在这种情况下,不需要真正的XML解析器是最佳解决方案),那么mmap可以为您提供帮助。

mmap the file, then call .rfind('\\n') on the resulting object (possibly with adjustments to handle the file ending with a newline when you really want the non-empty line before it, not the empty "line" following it). mmap文件,然后调用.rfind('\\n')得到的对象(可能有调整的处理以换行符结尾的文件,当你真正想要的非空行之前,而不是空洞的“线”以下是)。 You can then slice out the final line alone. 然后,您可以单独分割最后一行。 If you need to modify the file in place, you can resize the file to shave off (or add) a number of bytes corresponding to the difference between the line you sliced and the new line, then write back the new line. 如果需要在适当位置修改文件,则可以调整文件大小以刮除(或添加)与切片的行和新行之间的差异相对应的字节数,然后写回新行。 Avoids reading or writing any more of the file than you need. 避免读取或写入超出您需要的文件。

Example code (please comment if I made a mistake): 示例代码(如果我输入有误,请发表评论):

import mmap

# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    # len(mm) - 1 handles files ending w/newline by getting the prior line
    # + 1 to avoid catching prior newline (and handle one line file seamlessly)
    startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1

    # Get the line (with any newline stripped)
    line = mm[startofline:].rstrip(b'\r\n')

    # Do whatever calculates the new line, decoding/encoding to use str
    # in do_something to simplify; this is an XML file, so I'm assuming UTF-8
    new_line = do_something(line.decode('utf-8')).encode('utf-8')

    # Resize to accommodate the new line (or to strip data beyond the new line)
    mm.resize(startofline + len(new_line))  # + 1 if you need to add a trailing newline
    mm[startofline:] = new_line  # Replace contents; add a b"\n" if needed

Apparently on some systems (eg OSX) without mremap , mm.resize won't work, so to support those systems, you'd probably split the with (so the mmap closes before the file object), and use file object based seeks, writes and truncates to fix up the file. 显然,在某些没有mremap系统(例如OSX)上, mm.resize无法使用,因此,为了支持这些系统,您可能需要将with分开(因此mmap在文件对象之前关闭),并使用基于文件对象的搜索,写入并截断以修复文件。 The following example includes my previously mentioned Python 3.1 and earlier specific adjustment to use contextlib.closing for completeness: 以下示例包括我先前提到的Python 3.1和更早的特定调整,以使用contextlib.closing来确保完整性:

import mmap
from contextlib import closing

with open("large.XML", 'r+b') as myfile:
    with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline)  # Move to where old line began
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

The advantages to mmap over any other approach are: 与任何其他方法相比, mmap的优点是:

  1. No need to read any more of the file beyond the line itself (meaning 1-2 pages of the file, the rest never gets read or written) 无需再读取该行本身以外的文件(意味着文件的1-2页,其余的则永远不会被读取或写入)
  2. Using rfind means you can let Python do the work of finding the newline quickly at the C layer (in CPython); 使用rfind意味着您可以让Python进行在C层(在CPython中)快速查找换行符的工作; explicit seek s and read s of a file object could match the "only read a page or so", but you'd have to hand-implement the search for the newline 文件对象的显式seekread可以与“仅读取一页左右”相匹配,但是您必须手动执行换行搜索

Caveat: This approach will not work (at least, not without modification to avoid mapping more than 2 GB, and to handle resizing when the whole file might not be mapped) if you're on a 32 bit system and the file is too large to map into memory . 注意: 如果您使用的是32位系统,并且文件太大,则 此方法将不起作用 (至少,如果不进行修改,以避免映射超过2 GB,并且在可能未映射整个文件时进行调整大小),则该方法将无效 映射到内存中 On most 32 bit systems, even in a newly spawned process, you only have 1-2 GB of contiguous address space available; 在大多数32位系统上,即使是在新生成的进程中,也只有1-2 GB的连续地址空间可用; in certain special cases, you might have as much as 3-3.5 GB of user virtual addresses (though you'll lose some of the contiguous space to the heap, stack, executable mapping, etc.). 在某些特殊情况下,您可能拥有多达3-3.5 GB的用户虚拟地址(尽管您会丢失一些连续空间给堆,堆栈,可执行映射等)。 mmap doesn't require much physical RAM, but it needs contiguous address space; mmap不需要太多的物理RAM,但是需要连续的地址空间。 one of the huge benefits of a 64 bit OS is that you stop worrying about virtual address space in all but the most ridiculous cases, so mmap can solve problems in the general case that it couldn't handle without added complexity on a 32 bit OS. 64位操作系统的巨大好处之一是,除了最可笑的情况之外,您无需再担心虚拟地址空间,因此mmap可以解决一般情况下无法解决的问题,而在这种情况下,如果不增加32位OS的复杂性。 Most modern computers are 64 bit at this point, but it's definitely something to keep in mind if you're targeting 32 bit systems (and on Windows, even if the OS is 64 bit, they may have installed a 32 bit version of Python by mistake, so the same problems apply). 此时大多数现代计算机都是64位的,但是如果您要使用32位系统,则绝对要牢记一点(在Windows上,即使OS是64位,它们可能已经安装了32位版本的Python。错误,因此同样的问题也适用)。 Here's yet one more example that works (assuming the last line isn't 100+ MB long) on 32 bit Python (omitting closing and imports for brevity) even for huge files: 这是另一个示例,即使对于大型文件,也可以在32位Python上工作(假设最后一行的长度不超过100 MB)(为了简洁起见,省略了closing和导入):

with open("large.XML", 'r+b') as myfile:
    filesize = myfile.seek(0, 2)
    # Get an offset that only grabs the last 100 MB or so of the file aligned properly
    offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
    with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        # If line might be > 100 MB long, probably want to check if startofline
        # follows a newline here
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

Update: Use ShadowRanger's answer . 更新:使用ShadowRanger的答案 It's much shorter and robust. 它更短且更健壮。

For posterity: 对于后代:

Read the last N bytes of the file and search backwards for the newline. 读取文件的最后N个字节,然后向后搜索换行符。

#!/usr/bin/env python

with open("test.txt", "wb") as testfile:
    testfile.write('\n'.join(["one", "two", "three"]) + '\n')

with open("test.txt", "r+b") as myfile:
    # Read the last 1kiB of the file
    # we could make this be dynamic, but chances are there's
    # a number like 1kiB that'll work 100% of the time for you
    myfile.seek(0,2)
    filesize = myfile.tell()
    blocksize = min(1024, filesize)
    myfile.seek(-blocksize, 2)
    # search backwards for a newline (excluding very last byte
    # in case the file ends with a newline)
    index = myfile.read().rindex('\n', 0, blocksize - 1)
    # seek to the character just after the newline
    myfile.seek(index + 1 - blocksize, 2)
    # read in the last line of the file
    lastline = myfile.read()
    # modify last_line
    lastline = "Brand New Line!\n"
    # seek back to the start of the last line
    myfile.seek(index + 1 - blocksize, 2)
    # write out new version of the last line
    myfile.write(lastline)
    myfile.truncate()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取文本文件的第一行和最后一行的最有效方法是什么? - What is the most efficient way to get first and last line of a text file? 在 Python 中编辑文本文件最后一行的有效方法 - Efficient way to edit the last line of a text file in Python 搜索大型排序文本文件的最快和最有效的方法 - Quickest and most efficient way to search large sorted text file 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python Python-查找文本文件中同一行中每个可能的单词对出现频率的最有效方法? - Python - Most efficient way to find how often each possible pair of words occurs in the same line in a text file? 使用 python 从非常大的文本文件 (16gb) 中跳过任何行的省时方法 - Time efficient way to skip no of line from very large text file (16gb) using python 在python中解析大型.csv的最有效方法? - Most efficient way to parse a large .csv in python? 在python中将文本文件内容转换为字典的最有效方法 - most efficient way to convert text file contents into a dictionary in python 大文件中最省时的搜索-Python - Most time efficient search in a large file - Python 检查Python文件中行的最后一项的有效方法 - Efficient way to check last term of a line in Python file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM