[英]Why does python's open() function mangle my utf-8 files?
This is a strange one, and it might be due to a python update, because it worked fine yesterday with no changes.这是一个奇怪的问题,可能是由于 python 更新,因为它昨天运行良好,没有任何变化。 Here we go:开始了:
I have a program that opens utf-8 files (that use accented characters, etc, not just ansi characters).我有一个程序可以打开 utf-8 文件(使用重音字符等,而不仅仅是 ansi 字符)。 When I open the files with open(file, encoding="utf-8-sig").read()
, the non-ansi characters get mangled, as shown here in my terminal:当我使用open(file, encoding="utf-8-sig").read()
非ansi 字符会被破坏,如我的终端中所示:
mangled characters when encoding of open()
is set to "utf-8-sig"
open()
编码设置为"utf-8-sig"
时损坏的字符
However, when I set the encoding to "ansi"
, the characters are perfectly normal!但是,当我将编码设置为"ansi"
,字符完全正常!
normal characters with encoding="ansi"
encoding="ansi"
普通字符
This is a complete mystery to me.这对我来说完全是个谜。 As said before, this worked fine yesterday.如前所述,昨天这工作正常。 I've checked that the files were indeed utf-8, multiple times.我已经多次检查过这些文件确实是 utf-8。 I don't know if the problem is with the open() function, or the print() function when the characters are displayed.不知道是open()函数的问题,还是显示字符时print()函数的问题。 in any case, it's strange.无论如何,这很奇怪。 The "ansi"
version would be a solution, but the problem is that it causes problems with Lark , which uses the contents of the opened files. "ansi"
版本将是一个解决方案,但问题是它会导致Lark出现问题,它使用打开的文件的内容。
In the screenshots I gave here, the code is basic:在我在这里给出的屏幕截图中,代码是基本的:
with open(str(GRAMMAR), "r", encoding="utf-8-sig") as grammar:
print(grammar.read())
What could this problem be caused by?什么可能这个问题所致?
I just noticed something: ansi
is not an encoding.我刚刚注意到一些事情: ansi
不是编码。 The correct name for the encoding would be ascii
.编码的正确名称是ascii
。 This means that when I typed encoding="ansi"
, python ignored the encoding I asked it to set and read the file as its default encoding, which is normally utf-8.这意味着当我输入encoding="ansi"
,python 忽略了我要求它设置和读取文件作为其默认编码的编码,通常是 utf-8。 This does not explain why it doesn't work with utf-8-sig
or why Lark is screaming at me, but this is specific to my case.这并不能解释为什么它不适用于utf-8-sig
或为什么 Lark 对我尖叫,但这特定于我的情况。 So for future readers of this questions, check 2 things:因此,对于此问题的未来读者,请检查两件事:
ascii
, not ansi
.如果要使用 ascii,请键入ascii
,而不是ansi
。On Windows machines, Python recognises the name "ansi" as an alias for the "mbcs" codec, defined as在 Windows 机器上,Python 将名称“ansi”识别为“mbcs”编解码器的别名,定义为
Windows only: Encode the operand according to the ANSI codepage (CP_ACP).仅限 Windows:根据 ANSI 代码页 (CP_ACP) 对操作数进行编码。
So ansi is a valid encoding, but it isn't the same as ASCII, or UTF-8, hence the apparent mangling.所以 ansi 是一种有效的编码,但它与 ASCII 或 UTF-8 不同,因此存在明显的错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.