[英]Most efficient way to modify the last line of a large text file in Python
I need to update the last line from a few more than 2GB files made up of lines of text that can not be read with readlines()
. 我需要从超过2GB的文件中更新最后一行,这些文件由readlines()
无法读取的文本行组成。 Currently, it work fine by looping through line by line. 目前,它可以通过逐行循环工作。 However, I am wondering if there is any compiled library can achieve this more efficiently? 但是,我想知道是否有任何编译的库可以更有效地实现这一目标? Thanks! 谢谢!
myfile = open("large.XML")
for line in myfile:
do_something()
If this is really something line based (where a true XML parser isn't necessary the best solution), mmap
can help here. 如果这确实是基于行的内容(在这种情况下,不需要真正的XML解析器是最佳解决方案),那么mmap
可以为您提供帮助。
mmap
the file, then call .rfind('\\n')
on the resulting object (possibly with adjustments to handle the file ending with a newline when you really want the non-empty line before it, not the empty "line" following it). mmap
文件,然后调用.rfind('\\n')
得到的对象(可能有调整的处理以换行符结尾的文件,当你真正想要的非空行之前,而不是空洞的“线”以下是)。 You can then slice out the final line alone. 然后,您可以单独分割最后一行。 If you need to modify the file in place, you can resize the file to shave off (or add) a number of bytes corresponding to the difference between the line you sliced and the new line, then write back the new line. 如果需要在适当位置修改文件,则可以调整文件大小以刮除(或添加)与切片的行和新行之间的差异相对应的字节数,然后写回新行。 Avoids reading or writing any more of the file than you need. 避免读取或写入超出您需要的文件。
Example code (please comment if I made a mistake): 示例代码(如果我输入有误,请发表评论):
import mmap
# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
# len(mm) - 1 handles files ending w/newline by getting the prior line
# + 1 to avoid catching prior newline (and handle one line file seamlessly)
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# Get the line (with any newline stripped)
line = mm[startofline:].rstrip(b'\r\n')
# Do whatever calculates the new line, decoding/encoding to use str
# in do_something to simplify; this is an XML file, so I'm assuming UTF-8
new_line = do_something(line.decode('utf-8')).encode('utf-8')
# Resize to accommodate the new line (or to strip data beyond the new line)
mm.resize(startofline + len(new_line)) # + 1 if you need to add a trailing newline
mm[startofline:] = new_line # Replace contents; add a b"\n" if needed
Apparently on some systems (eg OSX) without mremap
, mm.resize
won't work, so to support those systems, you'd probably split the with
(so the mmap
closes before the file object), and use file object based seeks, writes and truncates to fix up the file. 显然,在某些没有mremap
系统(例如OSX)上, mm.resize
无法使用,因此,为了支持这些系统,您可能需要将with
分开(因此mmap
在文件对象之前关闭),并使用基于文件对象的搜索,写入并截断以修复文件。 The following example includes my previously mentioned Python 3.1 and earlier specific adjustment to use contextlib.closing
for completeness: 以下示例包括我先前提到的Python 3.1和更早的特定调整,以使用contextlib.closing
来确保完整性:
import mmap
from contextlib import closing
with open("large.XML", 'r+b') as myfile:
with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline) # Move to where old line began
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
The advantages to mmap
over any other approach are: 与任何其他方法相比, mmap
的优点是:
rfind
means you can let Python do the work of finding the newline quickly at the C layer (in CPython); 使用rfind
意味着您可以让Python进行在C层(在CPython中)快速查找换行符的工作; explicit seek
s and read
s of a file object could match the "only read a page or so", but you'd have to hand-implement the search for the newline 文件对象的显式seek
和read
可以与“仅读取一页左右”相匹配,但是您必须手动执行换行搜索 Caveat: This approach will not work (at least, not without modification to avoid mapping more than 2 GB, and to handle resizing when the whole file might not be mapped) if you're on a 32 bit system and the file is too large to map into memory . 注意: 如果您使用的是32位系统,并且文件太大,则 此方法将不起作用 (至少,如果不进行修改,以避免映射超过2 GB,并且在可能未映射整个文件时进行调整大小),则该方法将无效 映射到内存中 。 On most 32 bit systems, even in a newly spawned process, you only have 1-2 GB of contiguous address space available; 在大多数32位系统上,即使是在新生成的进程中,也只有1-2 GB的连续地址空间可用; in certain special cases, you might have as much as 3-3.5 GB of user virtual addresses (though you'll lose some of the contiguous space to the heap, stack, executable mapping, etc.). 在某些特殊情况下,您可能拥有多达3-3.5 GB的用户虚拟地址(尽管您会丢失一些连续空间给堆,堆栈,可执行映射等)。 mmap
doesn't require much physical RAM, but it needs contiguous address space; mmap
不需要太多的物理RAM,但是需要连续的地址空间。 one of the huge benefits of a 64 bit OS is that you stop worrying about virtual address space in all but the most ridiculous cases, so mmap
can solve problems in the general case that it couldn't handle without added complexity on a 32 bit OS. 64位操作系统的巨大好处之一是,除了最可笑的情况之外,您无需再担心虚拟地址空间,因此mmap
可以解决一般情况下无法解决的问题,而在这种情况下,如果不增加32位OS的复杂性。 Most modern computers are 64 bit at this point, but it's definitely something to keep in mind if you're targeting 32 bit systems (and on Windows, even if the OS is 64 bit, they may have installed a 32 bit version of Python by mistake, so the same problems apply). 此时大多数现代计算机都是64位的,但是如果您要使用32位系统,则绝对要牢记一点(在Windows上,即使OS是64位,它们可能已经安装了32位版本的Python。错误,因此同样的问题也适用)。 Here's yet one more example that works (assuming the last line isn't 100+ MB long) on 32 bit Python (omitting closing
and imports for brevity) even for huge files: 这是另一个示例,即使对于大型文件,也可以在32位Python上工作(假设最后一行的长度不超过100 MB)(为了简洁起见,省略了closing
和导入):
with open("large.XML", 'r+b') as myfile:
filesize = myfile.seek(0, 2)
# Get an offset that only grabs the last 100 MB or so of the file aligned properly
offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# If line might be > 100 MB long, probably want to check if startofline
# follows a newline here
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline + offset) # Move to where old line began, adjusted for offset
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
Update: Use ShadowRanger's answer . 更新:使用ShadowRanger的答案 。 It's much shorter and robust. 它更短且更健壮。
For posterity: 对于后代:
Read the last N bytes of the file and search backwards for the newline. 读取文件的最后N个字节,然后向后搜索换行符。
#!/usr/bin/env python
with open("test.txt", "wb") as testfile:
testfile.write('\n'.join(["one", "two", "three"]) + '\n')
with open("test.txt", "r+b") as myfile:
# Read the last 1kiB of the file
# we could make this be dynamic, but chances are there's
# a number like 1kiB that'll work 100% of the time for you
myfile.seek(0,2)
filesize = myfile.tell()
blocksize = min(1024, filesize)
myfile.seek(-blocksize, 2)
# search backwards for a newline (excluding very last byte
# in case the file ends with a newline)
index = myfile.read().rindex('\n', 0, blocksize - 1)
# seek to the character just after the newline
myfile.seek(index + 1 - blocksize, 2)
# read in the last line of the file
lastline = myfile.read()
# modify last_line
lastline = "Brand New Line!\n"
# seek back to the start of the last line
myfile.seek(index + 1 - blocksize, 2)
# write out new version of the last line
myfile.write(lastline)
myfile.truncate()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.