[英]Python - can I add UTF8 BOM to a file without opening it?
How to add an utf8-bom to a text file without open() it? 如何在不使用open()的情况下向文本文件添加utf8-bom?
Theoretically, we just need to add utf8-bom to the beginning of the file, we don't need to read-in 'all' the content? 从理论上讲,我们只需要在文件的开头添加utf8-bom,就不需要读入“所有”内容了?
You need to read in the data because you need to move all the data to make room for the BOM. 您需要读入数据,因为您需要移动所有数据以为BOM腾出空间。 Files can't just prepend arbitrary data.
文件不能只是添加任意数据。 Doing it in place is harder than just writing a new file with the BOM followed by the original data, then replacing the original file, so the easiest solution is usually something like:
在原地进行操作比仅在BOM表中写入原始数据和原始数据再替换原始文件要困难,因此最简单的解决方案通常是:
import os
import shutil
from os.path import dirname, realpath
from tempfile import NamedTemporaryFile
infile = ...
# Open original file as UTF-8 and tempfile in same directory to add sig
indir = dirname(realpath(infile))
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf:
with open(infile, encoding='utf-8') as f:
# Copy from one file to the other by blocks
# (avoids memory use of slurping whole file at once)
shutil.copyfileobj(f, tf)
# Optional: Replicate metadata of original file
tf.flush()
shutil.copystat(f.name, tf.name) # Replicate permissions of original file
# Atomically replace original file with BOM marked file
os.replace(tf.name, f.name)
# Don't try to delete temp file if everything worked
tf.delete = False
This also verifies that the input file was in fact UTF-8 by side-effect, and the original file never exists in an inconsistent state; 这也可以验证输入文件实际上是UTF-8的副作用,并且原始文件永远不会以不一致的状态存在; it's either the old or the new data, not the intermediate working copy.
它是旧数据还是新数据,而不是中间工作副本。
If your files are large and your disk space is limited (so you can't have two copies on disk at once), then in-place mutation might be acceptable. 如果文件很大且磁盘空间有限(因此一次不能在磁盘上拥有两个副本),则就地突变可能是可以接受的。 The easiest way to do this is the
mmap
module which simplifies the process of moving the data around considerably vs. using in-place file object operations: 最简单的方法是
mmap
模块,与使用就地文件对象操作相比,该模块简化了大量移动数据的过程:
import codecs
import mmap
# Open file for read and write and then immediately map the whole file for write
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
origsize = mm.size()
bomlen = len(codecs.BOM_UTF8)
# Allocate additional space for BOM
mm.resize(origsize+bomlen)
# Copy file contents down to make room for BOM
# This reads and writes the whole file, and is unavoidable
mm.move(bomlen, 0, origsize)
# Insert the BOM before the shifted data
mm[:bomlen] = codecs.BOM_UTF8
If you need in-place update, something like 如果您需要就地更新,则类似
def add_bom(fname, bom=None, buf_size=None):
bom = bom or BOM
buf_size = buf_size or max(resource.getpagesize(), len(bom))
buf = bytearray(buf_size)
with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd:
# we cannot just just read until eof, because we
# will be writing to that very same file, extending it.
out_fd.seek(0, 2)
nbytes = out_fd.tell()
out_fd.seek(0)
# Actually, we want to pass buf[0:n_bytes], but
# that doesn't result in in-place updates.
in_bytes = in_fd.readinto(buf)
if in_bytes < len(bom) or not buf.startswith(bom):
# don't write the BOM if it's already there
out_fd.write(bom)
while nbytes > 0:
# if we still need to write data, do so.
# but only write as much data as we need
out_fd.write(buffer(buf, 0, min(in_bytes, nbytes)))
nbytes -= in_bytes
in_bytes = in_fd.readinto(buf)
should do the trick. 应该可以。
As you can see, in-place updates are a little finnicky, because you are 如您所见,就地更新有点困难,因为您
In addition, this may leave the file in an inconsistent state. 此外,这可能会使文件处于不一致状态。 The copy to temporary -> move temporary to original method is much preferred if possible.
如果可能的话,首选复制到临时文件->将临时文件移动到原始方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.