简体   繁体   English

在Python中将文本有效地放在很大的文本文件之前

[英]Efficiently prepending text to a very large text file in Python

I have to prepend some arbitrary text to an existing, but very large (2 - 10 GB range) text file. 我必须在现有但非常大(2-10 GB范围)的文本文件之前添加一些任意文本。 With the file being so large, I'm trying to avoid reading the entire file in to memory. 由于文件太大,我试图避免将整个文件读入内存。 But am I being too conservative with a line-by-line iteration? 但是我对逐行迭代是否过于保守? Would moving to a readlines( sizehint ) approach give me much of a performance advantage over my current approach? 与目前的方法相比,采用readlines( sizehint )方法是否会给我带来很多性能优势?

The delete-and-move at the end is less than ideal but, as far as I know, there's no way to do this sort of manipulation with linear data, in place. 最后的删除和移动并不理想,但是据我所知,没有办法对线性数据进行适当的处​​理。 But I'm not so well versed in Python -- maybe there's something unique to Python I can exploit to do this better? 但是我并不精通Python-也许我可以利用Python的一些独特之处来更好地做到这一点?

import os
import shutil
def prependToFile(f, text):
    f_temp = generateTempFileName(f)
    inFile  = open(f, 'r')
    outFile = open(f_temp, 'w')    
    outFile.write('# START\n')
    outFile.write('%s\n' % str(text))
    outFile.write('# END\n\n')
    for line in inFile:
        outFile.write(line)
    inFile.close()
    outFile.close()
    os.remove(f)
    shutil.move(f_temp, f)

If this is on Windows NTFS, you can insert into the middle of a file. 如果在Windows NTFS上,则可以插入文件的中间。 (Or so I'm told, I'm not a Windows developer). (或者,有人告诉我,我不是Windows开发人员)。

If this is on a POSIX (Linux or Unix) system, you should use "cat" as someone else said. 如果这是在POSIX(Linux或Unix)系统上,则应使用“ cat”作为其他人所说的。 cat is wickedly efficient, using every trick in the book to get optimal performance (ie. voids copying buffers, etc.) cat效率极高,可以利用书中的所有技巧获得最佳性能(例如,无效的复制缓冲区等)

However, if you must do it in python, the code you presented could be improved by using shutil.copyfileobj() (which takes 2 file handles) and tempfile.TemporaryFile (create a file that automatically gets deleted on close): 但是,如果必须在python中执行此操作,则可以通过使用shutil.copyfileobj()(带有2个文件句柄)和tempfile.TemporaryFile(创建一个在关闭时自动删除的文件)来改进显示的代码:

import os
import shutil
import tempfile

def prependToFile(f, text):
    outFile = tempfile.NamedTemporaryFile(dir='.', delete=False)
    outFile.write('# START\n')
    outFile.write('%s\n' % str(text))
    outFile.write('# END\n\n')
    shutil.copyfileobj(file(f, 'r'), outFile)
    os.remove(f)
    shutil.move(outFile.name, f)
    outFile.close()

I think the os.remove(f) isn't needed as shutil.move() will delete f. 我认为os.remove(f)是不需要的,因为shutil.move()会删除f。 However, you should test that. 但是,您应该对此进行测试。 Also, the "delete=False" may not be needed but may be safe to leave it. 同样,可能不需要“ delete = False”,但可以放心地删除它。

您可以使用更适合作业os.system("cat file1 file2 > file3")工具os.system("cat file1 file2 > file3")

What you want to do is read the file in large (anywhere from 64k to several MB) blocks and write the blocks out. 您要做的是读取大文件(从64k到几MB)的文件,然后将其写出。 In other words, instead of individual lines, use huge blocks. 换句话说,请使用巨大的块而不是单独的行。 That way you do the fewest I/Os possible and hopefully your process is I/O-bound instead of CPU-bound. 这样,您就可以进行最少的I / O,并希望您的进程是I / O绑定的,而不是CPU绑定的。

To be honest, I would recommend you just write this in C instead if you're worried about execution time. 老实说,如果您担心执行时间,我建议您只用C编写。 Doing system calls from Python can be quite slow, and since you'll have to do a lot of them whether you do the line-by-line or raw block read approach, that will really drag things down. 从Python进行系统调用可能会非常慢,并且由于必须逐行或原始块读取方法,因此必须执行很多操作,这确实会拖累事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM