简体   繁体   中英

Efficiently prepending text to a very large text file in Python

I have to prepend some arbitrary text to an existing, but very large (2 - 10 GB range) text file. With the file being so large, I'm trying to avoid reading the entire file in to memory. But am I being too conservative with a line-by-line iteration? Would moving to a readlines( sizehint ) approach give me much of a performance advantage over my current approach?

The delete-and-move at the end is less than ideal but, as far as I know, there's no way to do this sort of manipulation with linear data, in place. But I'm not so well versed in Python -- maybe there's something unique to Python I can exploit to do this better?

import os
import shutil
def prependToFile(f, text):
    f_temp = generateTempFileName(f)
    inFile  = open(f, 'r')
    outFile = open(f_temp, 'w')    
    outFile.write('# START\n')
    outFile.write('%s\n' % str(text))
    outFile.write('# END\n\n')
    for line in inFile:
        outFile.write(line)
    inFile.close()
    outFile.close()
    os.remove(f)
    shutil.move(f_temp, f)

If this is on Windows NTFS, you can insert into the middle of a file. (Or so I'm told, I'm not a Windows developer).

If this is on a POSIX (Linux or Unix) system, you should use "cat" as someone else said. cat is wickedly efficient, using every trick in the book to get optimal performance (ie. voids copying buffers, etc.)

However, if you must do it in python, the code you presented could be improved by using shutil.copyfileobj() (which takes 2 file handles) and tempfile.TemporaryFile (create a file that automatically gets deleted on close):

import os
import shutil
import tempfile

def prependToFile(f, text):
    outFile = tempfile.NamedTemporaryFile(dir='.', delete=False)
    outFile.write('# START\n')
    outFile.write('%s\n' % str(text))
    outFile.write('# END\n\n')
    shutil.copyfileobj(file(f, 'r'), outFile)
    os.remove(f)
    shutil.move(outFile.name, f)
    outFile.close()

I think the os.remove(f) isn't needed as shutil.move() will delete f. However, you should test that. Also, the "delete=False" may not be needed but may be safe to leave it.

您可以使用更适合作业os.system("cat file1 file2 > file3")工具os.system("cat file1 file2 > file3")

What you want to do is read the file in large (anywhere from 64k to several MB) blocks and write the blocks out. In other words, instead of individual lines, use huge blocks. That way you do the fewest I/Os possible and hopefully your process is I/O-bound instead of CPU-bound.

To be honest, I would recommend you just write this in C instead if you're worried about execution time. Doing system calls from Python can be quite slow, and since you'll have to do a lot of them whether you do the line-by-line or raw block read approach, that will really drag things down.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM