简体   繁体   English

附加到结尾时文件中间的python utf-8-sig BOM

[英]python utf-8-sig BOM in the middle of the file when appending to the end

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. 我最近注意到Python在使用utf-8-sig编码附加到文件时表现得非常明显。 See below: 见下文:

>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')

The following text ends up to the file: 以下文本以文件结尾:

<BOM>123
<BOM>123

Isn't that a bug? 这不是一个bug吗? This is so not logical. 这是不合逻辑的。 Could anyone explain to me why it was done so? 任何人都可以向我解释为什么会这样做? Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created? 为什么不在文件不存在且需要创建时才设置BOM?

No, it's not a bug; 不,这不是一个bug; that's perfectly normal, expected behavior. 这是完全正常的,预期的行为。 The codec cannot detect how much was already written to a file; 编解码器无法检测已经写入文件的数量; you could use it to append to a pre-created but empty file for example. 例如,您可以使用它附加到预先创建但空的文件。 The file would not be new, but it would not contain a BOM either. 该文件不是新文件,但也不包含BOM。

Then there are other use-cases where the codec is used on a stream or bytestring (eg not with codecs.open() ) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always. 然后还有其他用例,其中编解码器用于流或字节codecs.open() (例如,不使用codecs.open() ),其中根本没有文件要测试,或者开发人员想要在开始时强制执行BOM输出,总是。

Only use utf-8-sig on a new file; 仅在文件上使用utf-8-sig ; the codec will always write the BOM out whenever you use it. 编解码器将始终在您使用时写出BOM。

If you are working directly with files, you can test for the start yourself; 如果您直接使用文件,您可以自己测试一下; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE : 使用utf-8代替并手动编写BOM,这只是一个编码的U + FEFF ZERO WIDTH NO-BREAK SPACE

import io

with io.open(filename, 'a', encoding='utf8') as outfh:
    if outfh.tell() == 0:
        # start of file
        outfh.write(u'\ufeff')

I used the newer io.open() instead of codecs.open() ; 我使用了较新的io.open()而不是codecs.open() ; io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience. io是为Python 3开发的新I / O框架,根据我的经验,它比处理编码文件的codecs更强大。

Note that the UTF-8 BOM is next to useless, really. 请注意,UTF-8 BOM实际上是无用的。 UTF-8 has no variable byte order , so there is only one Byte Order Mark. UTF-8 没有可变字节顺序 ,因此只有一个字节顺序标记。 UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed. 另一方面,UTF-16或UTF-32可以用两个不同的字节顺序之一写入,这就是需要BOM的原因。

The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (eg not one of the legacy code pages). Microsoft产品主要使用UTF-8 BOM来自动检测文件的编码(例如, 不是遗留代码页之一)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法将通过 paramiko 打开的 csv 文件编码为 utf-8-sig 以使用 Python 删除 BOM - Can't encode csv file opened through paramiko as utf-8-sig to remove BOM using Python 打开文件以在 Python 中读取时,我应该使用 utf8 还是 utf-8-sig? - Should I use utf8 or utf-8-sig when opening a file to read in Python? Python - 使用 utf-8-sig 编码读取远程 CSV 文件 - Python - Read remote CSV file with utf-8-sig encoding 将 UTF-8-sig csv 文件下载给用户 - Downloading UTF-8-sig csv file to user “utf-8-sig”是否适合解码 UTF-8 和 UTF-8 BOM? - Is "utf-8-sig" suitable for decoding both UTF-8 and UTF-8 BOM? json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig) - json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig) 无法解决“ json.decoder.JSONDecodeError:意外的UTF-8 BOM(使用utf-8-sig解码)错误”。有人可以协助吗? - Unable to resolve “json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig) error”.Can anyone assist? Pandas read_json() encoding = 'utf-8-sig' 选项不适用于 BytesIO object(类文件对象) - Pandas read_json() encoding = 'utf-8-sig' option is not working for BytesIO object (file-like object) utf-8 和 utf-8-sig 有什么区别? - What is the difference between utf-8 and utf-8-sig? 为什么此程序(开放式编码utf-8 utf-8-sig)在某些情况下会失败,而不是在其他情况下会失败 - Why this program ( open encodings utf-8 utf-8-sig ) fails in some context, not in other context
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM