附加到结尾时文件中间的python utf-8-sig BOM

Question

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. 我最近注意到Python在使用utf-8-sig编码附加到文件时表现得非常明显。 See below: 见下文：

>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')

The following text ends up to the file: 以下文本以文件结尾：

<BOM>123
<BOM>123

Isn't that a bug? 这不是一个bug吗？ This is so not logical. 这是不合逻辑的。 Could anyone explain to me why it was done so? 任何人都可以向我解释为什么会这样做？ Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created? 为什么不在文件不存在且需要创建时才设置BOM？

Answer 1

No, it's not a bug; 不，这不是一个bug; that's perfectly normal, expected behavior. 这是完全正常的，预期的行为。 The codec cannot detect how much was already written to a file; 编解码器无法检测已经写入文件的数量; you could use it to append to a pre-created but empty file for example. 例如，您可以使用它附加到预先创建但空的文件。 The file would not be new, but it would not contain a BOM either. 该文件不是新文件，但也不包含BOM。

Then there are other use-cases where the codec is used on a stream or bytestring (eg not with codecs.open() ) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always. 然后还有其他用例，其中编解码器用于流或字节codecs.open() （例如，不使用codecs.open() ），其中根本没有文件要测试，或者开发人员想要在开始时强制执行BOM输出，总是。

Only use utf-8-sig on a new file; 仅在新文件上使用utf-8-sig ; the codec will always write the BOM out whenever you use it. 编解码器将始终在您使用时写出BOM。

If you are working directly with files, you can test for the start yourself; 如果您直接使用文件，您可以自己测试一下; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE : 使用utf-8代替并手动编写BOM，这只是一个编码的U + FEFF ZERO WIDTH NO-BREAK SPACE ：

import io

with io.open(filename, 'a', encoding='utf8') as outfh:
    if outfh.tell() == 0:
        # start of file
        outfh.write(u'\ufeff')

I used the newer io.open() instead of codecs.open() ; 我使用了较新的io.open()而不是codecs.open() ; io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience. io是为Python 3开发的新I / O框架，根据我的经验，它比处理编码文件的codecs更强大。

Note that the UTF-8 BOM is next to useless, really. 请注意，UTF-8 BOM实际上是无用的。 UTF-8 has no variable byte order , so there is only one Byte Order Mark. UTF-8 没有可变字节顺序 ，因此只有一个字节顺序标记。 UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed. 另一方面，UTF-16或UTF-32可以用两个不同的字节顺序之一写入，这就是需要BOM的原因。

The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (eg not one of the legacy code pages). Microsoft产品主要使用UTF-8 BOM来自动检测文件的编码（例如，不是遗留代码页之一）。

附加到结尾时文件中间的python utf-8-sig BOM

问题描述

1 个解决方案

解决方案1
8 已采纳 2014-04-18 12:44:02

附加到结尾时文件中间的python utf-8-sig BOM

问题描述

1 个解决方案

解决方案1 8 已采纳 2014-04-18 12:44:02

解决方案1
8 已采纳 2014-04-18 12:44:02