![](/img/trans.png)
[英]What is the most efficient way to get first and last line of a text file?
[英]Most efficient way to modify the last line of a large text file in Python
我需要从超过2GB的文件中更新最后一行,这些文件由readlines()
无法读取的文本行组成。 目前,它可以通过逐行循环工作。 但是,我想知道是否有任何编译的库可以更有效地实现这一目标? 谢谢!
myfile = open("large.XML")
for line in myfile:
do_something()
如果这确实是基于行的内容(在这种情况下,不需要真正的XML解析器是最佳解决方案),那么mmap
可以为您提供帮助。
mmap
文件,然后调用.rfind('\\n')
得到的对象(可能有调整的处理以换行符结尾的文件,当你真正想要的非空行之前,而不是空洞的“线”以下是)。 然后,您可以单独分割最后一行。 如果需要在适当位置修改文件,则可以调整文件大小以刮除(或添加)与切片的行和新行之间的差异相对应的字节数,然后写回新行。 避免读取或写入超出您需要的文件。
示例代码(如果我输入有误,请发表评论):
import mmap
# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
# len(mm) - 1 handles files ending w/newline by getting the prior line
# + 1 to avoid catching prior newline (and handle one line file seamlessly)
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# Get the line (with any newline stripped)
line = mm[startofline:].rstrip(b'\r\n')
# Do whatever calculates the new line, decoding/encoding to use str
# in do_something to simplify; this is an XML file, so I'm assuming UTF-8
new_line = do_something(line.decode('utf-8')).encode('utf-8')
# Resize to accommodate the new line (or to strip data beyond the new line)
mm.resize(startofline + len(new_line)) # + 1 if you need to add a trailing newline
mm[startofline:] = new_line # Replace contents; add a b"\n" if needed
显然,在某些没有mremap
系统(例如OSX)上, mm.resize
无法使用,因此,为了支持这些系统,您可能需要将with
分开(因此mmap
在文件对象之前关闭),并使用基于文件对象的搜索,写入并截断以修复文件。 以下示例包括我先前提到的Python 3.1和更早的特定调整,以使用contextlib.closing
来确保完整性:
import mmap
from contextlib import closing
with open("large.XML", 'r+b') as myfile:
with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline) # Move to where old line began
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
与任何其他方法相比, mmap
的优点是:
rfind
意味着您可以让Python进行在C层(在CPython中)快速查找换行符的工作; 文件对象的显式seek
和read
可以与“仅读取一页左右”相匹配,但是您必须手动执行换行搜索 注意: 如果您使用的是32位系统,并且文件太大,则 此方法将不起作用 (至少,如果不进行修改,以避免映射超过2 GB,并且在可能未映射整个文件时进行调整大小),则该方法将无效 映射到内存中 。 在大多数32位系统上,即使是在新生成的进程中,也只有1-2 GB的连续地址空间可用; 在某些特殊情况下,您可能拥有多达3-3.5 GB的用户虚拟地址(尽管您会丢失一些连续空间给堆,堆栈,可执行映射等)。 mmap
不需要太多的物理RAM,但是需要连续的地址空间。 64位操作系统的巨大好处之一是,除了最可笑的情况之外,您无需再担心虚拟地址空间,因此mmap
可以解决一般情况下无法解决的问题,而在这种情况下,如果不增加32位OS的复杂性。 此时大多数现代计算机都是64位的,但是如果您要使用32位系统,则绝对要牢记一点(在Windows上,即使OS是64位,它们可能已经安装了32位版本的Python。错误,因此同样的问题也适用)。 这是另一个示例,即使对于大型文件,也可以在32位Python上工作(假设最后一行的长度不超过100 MB)(为了简洁起见,省略了closing
和导入):
with open("large.XML", 'r+b') as myfile:
filesize = myfile.seek(0, 2)
# Get an offset that only grabs the last 100 MB or so of the file aligned properly
offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# If line might be > 100 MB long, probably want to check if startofline
# follows a newline here
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline + offset) # Move to where old line began, adjusted for offset
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
更新:使用ShadowRanger的答案 。 它更短且更健壮。
对于后代:
读取文件的最后N个字节,然后向后搜索换行符。
#!/usr/bin/env python
with open("test.txt", "wb") as testfile:
testfile.write('\n'.join(["one", "two", "three"]) + '\n')
with open("test.txt", "r+b") as myfile:
# Read the last 1kiB of the file
# we could make this be dynamic, but chances are there's
# a number like 1kiB that'll work 100% of the time for you
myfile.seek(0,2)
filesize = myfile.tell()
blocksize = min(1024, filesize)
myfile.seek(-blocksize, 2)
# search backwards for a newline (excluding very last byte
# in case the file ends with a newline)
index = myfile.read().rindex('\n', 0, blocksize - 1)
# seek to the character just after the newline
myfile.seek(index + 1 - blocksize, 2)
# read in the last line of the file
lastline = myfile.read()
# modify last_line
lastline = "Brand New Line!\n"
# seek back to the start of the last line
myfile.seek(index + 1 - blocksize, 2)
# write out new version of the last line
myfile.write(lastline)
myfile.truncate()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.