在Python中修改大型文本文件的最后一行的最有效方法

Question

我需要从超过2GB的文件中更新最后一行，这些文件由readlines()无法读取的文本行组成。 目前，它可以通过逐行循环工作。 但是，我想知道是否有任何编译的库可以更有效地实现这一目标？ 谢谢！

目前的方法

    myfile = open("large.XML")
    for line in myfile:
        do_something()

Answer 1

如果这确实是基于行的内容（在这种情况下，不需要真正的XML解析器是最佳解决方案），那么mmap可以为您提供帮助。

mmap文件，然后调用.rfind('\\n')得到的对象（可能有调整的处理以换行符结尾的文件，当你真正想要的非空行之前，而不是空洞的“线”以下是）。 然后，您可以单独分割最后一行。 如果需要在适当位置修改文件，则可以调整文件大小以刮除（或添加）与切片的行和新行之间的差异相对应的字节数，然后写回新行。 避免读取或写入超出您需要的文件。

示例代码（如果我输入有误，请发表评论）：

import mmap

# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    # len(mm) - 1 handles files ending w/newline by getting the prior line
    # + 1 to avoid catching prior newline (and handle one line file seamlessly)
    startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1

    # Get the line (with any newline stripped)
    line = mm[startofline:].rstrip(b'\r\n')

    # Do whatever calculates the new line, decoding/encoding to use str
    # in do_something to simplify; this is an XML file, so I'm assuming UTF-8
    new_line = do_something(line.decode('utf-8')).encode('utf-8')

    # Resize to accommodate the new line (or to strip data beyond the new line)
    mm.resize(startofline + len(new_line))  # + 1 if you need to add a trailing newline
    mm[startofline:] = new_line  # Replace contents; add a b"\n" if needed

显然，在某些没有mremap系统（例如OSX）上， mm.resize无法使用，因此，为了支持这些系统，您可能需要将with分开（因此mmap在文件对象之前关闭），并使用基于文件对象的搜索，写入并截断以修复文件。 以下示例包括我先前提到的Python 3.1和更早的特定调整，以使用contextlib.closing来确保完整性：

import mmap
from contextlib import closing

with open("large.XML", 'r+b') as myfile:
    with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline)  # Move to where old line began
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

与任何其他方法相比， mmap的优点是：

无需再读取该行本身以外的文件（意味着文件的1-2页，其余的则永远不会被读取或写入）
使用rfind意味着您可以让Python进行在C层（在CPython中）快速查找换行符的工作； 文件对象的显式seek和read可以与“仅读取一页左右”相匹配，但是您必须手动执行换行搜索

注意： 如果您使用的是32位系统，并且文件太大，则 此方法将不起作用 （至少，如果不进行修改，以避免映射超过2 GB，并且在可能未映射整个文件时进行调整大小），则该方法将无效 映射到内存中 。 在大多数32位系统上，即使是在新生成的进程中，也只有1-2 GB的连续地址空间可用； 在某些特殊情况下，您可能拥有多达3-3.5 GB的用户虚拟地址（尽管您会丢失一些连续空间给堆，堆栈，可执行映射等）。 mmap不需要太多的物理RAM，但是需要连续的地址空间。 64位操作系统的巨大好处之一是，除了最可笑的情况之外，您无需再担心虚拟地址空间，因此mmap可以解决一般情况下无法解决的问题，而在这种情况下，如果不增加32位OS的复杂性。此时大多数现代计算机都是64位的，但是如果您要使用32位系统，则绝对要牢记一点（在Windows上，即使OS是64位，它们可能已经安装了32位版本的Python。错误，因此同样的问题也适用）。 这是另一个示例，即使对于大型文件，也可以在32位Python上工作（假设最后一行的长度不超过100 MB）（为了简洁起见，省略了closing和导入）：

with open("large.XML", 'r+b') as myfile:
    filesize = myfile.seek(0, 2)
    # Get an offset that only grabs the last 100 MB or so of the file aligned properly
    offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
    with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        # If line might be > 100 MB long, probably want to check if startofline
        # follows a newline here
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

Answer 2

更新：使用ShadowRanger的答案。 它更短且更健壮。

对于后代：

读取文件的最后N个字节，然后向后搜索换行符。

#!/usr/bin/env python

with open("test.txt", "wb") as testfile:
    testfile.write('\n'.join(["one", "two", "three"]) + '\n')

with open("test.txt", "r+b") as myfile:
    # Read the last 1kiB of the file
    # we could make this be dynamic, but chances are there's
    # a number like 1kiB that'll work 100% of the time for you
    myfile.seek(0,2)
    filesize = myfile.tell()
    blocksize = min(1024, filesize)
    myfile.seek(-blocksize, 2)
    # search backwards for a newline (excluding very last byte
    # in case the file ends with a newline)
    index = myfile.read().rindex('\n', 0, blocksize - 1)
    # seek to the character just after the newline
    myfile.seek(index + 1 - blocksize, 2)
    # read in the last line of the file
    lastline = myfile.read()
    # modify last_line
    lastline = "Brand New Line!\n"
    # seek back to the start of the last line
    myfile.seek(index + 1 - blocksize, 2)
    # write out new version of the last line
    myfile.write(lastline)
    myfile.truncate()

在Python中修改大型文本文件的最后一行的最有效方法

问题描述

目前的方法

2 个解决方案

解决方案1
6 2015-11-19 18:37:11

解决方案2
2 已采纳 2015-11-19 19:00:43

在Python中修改大型文本文件的最后一行的最有效方法

问题描述

目前的方法

2 个解决方案

解决方案1 6 2015-11-19 18:37:11

解决方案2 2 已采纳 2015-11-19 19:00:43

解决方案1
6 2015-11-19 18:37:11

解决方案2
2 已采纳 2015-11-19 19:00:43