简体   繁体   English

Python-是否可以在不打开UTF8 BOM的情况下将其添加到文件中?

[英]Python - can I add UTF8 BOM to a file without opening it?

How to add an utf8-bom to a text file without open() it? 如何在不使用open()的情况下向文本文件添加utf8-bom?

Theoretically, we just need to add utf8-bom to the beginning of the file, we don't need to read-in 'all' the content? 从理论上讲,我们只需要在文件的开头添加utf8-bom,就不需要读入“所有”内容了?

You need to read in the data because you need to move all the data to make room for the BOM. 您需要读入数据,因为您需要移动所有数据以为BOM腾出空间。 Files can't just prepend arbitrary data. 文件不能只是添加任意数据。 Doing it in place is harder than just writing a new file with the BOM followed by the original data, then replacing the original file, so the easiest solution is usually something like: 在原地进行操作比仅在BOM表中写入原始数据和原始数据再替换原始文件要困难,因此最简单的解决方案通常是:

import os
import shutil

from os.path import dirname, realpath
from tempfile import NamedTemporaryFile

infile = ...

# Open original file as UTF-8 and tempfile in same directory to add sig
indir = dirname(realpath(infile))
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf:
    with open(infile, encoding='utf-8') as f:
        # Copy from one file to the other by blocks 
        # (avoids memory use of slurping whole file at once)
        shutil.copyfileobj(f, tf)

    # Optional: Replicate metadata of original file
    tf.flush()
    shutil.copystat(f.name, tf.name) # Replicate permissions of original file

    # Atomically replace original file with BOM marked file
    os.replace(tf.name, f.name)

    # Don't try to delete temp file if everything worked
    tf.delete = False

This also verifies that the input file was in fact UTF-8 by side-effect, and the original file never exists in an inconsistent state; 这也可以验证输入文件实际上是UTF-8的副作用,并且原始文件永远不会以不一致的状态存在; it's either the old or the new data, not the intermediate working copy. 它是旧数据还是新数据,而不是中间工作副本。

If your files are large and your disk space is limited (so you can't have two copies on disk at once), then in-place mutation might be acceptable. 如果文件很大且磁盘空间有限(因此一次不能在磁盘上拥有两个副本),则就地突变可能是可以接受的。 The easiest way to do this is the mmap module which simplifies the process of moving the data around considerably vs. using in-place file object operations: 最简单的方法是mmap模块,与使用就地文件对象操作相比,该模块简化了大量移动数据的过程:

import codecs
import mmap

# Open file for read and write and then immediately map the whole file for write
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    origsize = mm.size()
    bomlen = len(codecs.BOM_UTF8)
    # Allocate additional space for BOM
    mm.resize(origsize+bomlen)

    # Copy file contents down to make room for BOM
    # This reads and writes the whole file, and is unavoidable
    mm.move(bomlen, 0, origsize)

    # Insert the BOM before the shifted data
    mm[:bomlen] = codecs.BOM_UTF8

If you need in-place update, something like 如果您需要就地更新,则类似

def add_bom(fname, bom=None, buf_size=None):
    bom = bom or BOM
    buf_size = buf_size or max(resource.getpagesize(), len(bom))
    buf = bytearray(buf_size)
    with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd:
        # we cannot just just read until eof, because we
        # will be writing to that very same file, extending it.
        out_fd.seek(0, 2)
        nbytes = out_fd.tell()
        out_fd.seek(0)
        # Actually, we want to pass buf[0:n_bytes], but 
        # that doesn't result in in-place updates.
        in_bytes = in_fd.readinto(buf)
        if in_bytes < len(bom) or not buf.startswith(bom):
            # don't write the BOM if it's already there
            out_fd.write(bom)
        while nbytes > 0:
            # if we still need to write data, do so.
            # but only write as much data as we need
            out_fd.write(buffer(buf, 0, min(in_bytes, nbytes)))
            nbytes -= in_bytes
            in_bytes = in_fd.readinto(buf)

should do the trick. 应该可以。

As you can see, in-place updates are a little finnicky, because you are 如您所见,就地更新有点困难,因为您

  1. Writing data to the place you've just read from. 将数据写入刚读过的地方。 The read must always stay ahead of the write, otherwise you are overwriting not-yet-processed data. 读取必须始终在写入之前,否则您将覆盖尚未处理的数据。
  2. Extending the file you are reading, so reading till EOF doesn't work. 扩展您正在读取的文件,因此直到EOF都无法读取为止。

In addition, this may leave the file in an inconsistent state. 此外,这可能会使文件处于不一致状态。 The copy to temporary -> move temporary to original method is much preferred if possible. 如果可能的话,首选复制到临时文件->将临时文件移动到原始方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM