[英]python utf-8-sig BOM in the middle of the file when appending to the end
I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig
encoding. 我最近注意到Python在使用
utf-8-sig
编码附加到文件时表现得非常明显。 See below: 见下文:
>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
The following text ends up to the file: 以下文本以文件结尾:
<BOM>123
<BOM>123
Isn't that a bug? 这不是一个bug吗? This is so not logical.
这是不合逻辑的。 Could anyone explain to me why it was done so?
任何人都可以向我解释为什么会这样做? Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?
为什么不在文件不存在且需要创建时才设置BOM?
No, it's not a bug; 不,这不是一个bug; that's perfectly normal, expected behavior.
这是完全正常的,预期的行为。 The codec cannot detect how much was already written to a file;
编解码器无法检测已经写入文件的数量; you could use it to append to a pre-created but empty file for example.
例如,您可以使用它附加到预先创建但空的文件。 The file would not be new, but it would not contain a BOM either.
该文件不是新文件,但也不包含BOM。
Then there are other use-cases where the codec is used on a stream or bytestring (eg not with codecs.open()
) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always. 然后还有其他用例,其中编解码器用于流或字节
codecs.open()
(例如,不使用codecs.open()
),其中根本没有文件要测试,或者开发人员想要在开始时强制执行BOM输出,总是。
Only use utf-8-sig
on a new file; 仅在新文件上使用
utf-8-sig
; the codec will always write the BOM out whenever you use it. 编解码器将始终在您使用时写出BOM。
If you are working directly with files, you can test for the start yourself; 如果您直接使用文件,您可以自己测试一下; use
utf-8
instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE : 使用
utf-8
代替并手动编写BOM,这只是一个编码的U + FEFF ZERO WIDTH NO-BREAK SPACE :
import io
with io.open(filename, 'a', encoding='utf8') as outfh:
if outfh.tell() == 0:
# start of file
outfh.write(u'\ufeff')
I used the newer io.open()
instead of codecs.open()
; 我使用了较新的
io.open()
而不是codecs.open()
; io
is the new I/O framework developed for Python 3, and is more robust than codecs
for handling encoded files, in my experience. io
是为Python 3开发的新I / O框架,根据我的经验,它比处理编码文件的codecs
更强大。
Note that the UTF-8 BOM is next to useless, really. 请注意,UTF-8 BOM实际上是无用的。 UTF-8 has no variable byte order , so there is only one Byte Order Mark.
UTF-8 没有可变字节顺序 ,因此只有一个字节顺序标记。 UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
另一方面,UTF-16或UTF-32可以用两个不同的字节顺序之一写入,这就是需要BOM的原因。
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (eg not one of the legacy code pages). Microsoft产品主要使用UTF-8 BOM来自动检测文件的编码(例如, 不是遗留代码页之一)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.