为什么 python 的 open() 函数会破坏我的 utf-8 文件？

Question

This is a strange one, and it might be due to a python update, because it worked fine yesterday with no changes.这是一个奇怪的问题，可能是由于 python 更新，因为它昨天运行良好，没有任何变化。 Here we go:开始了：

I have a program that opens utf-8 files (that use accented characters, etc, not just ansi characters).我有一个程序可以打开 utf-8 文件（使用重音字符等，而不仅仅是 ansi 字符）。 When I open the files with open(file, encoding="utf-8-sig").read() , the non-ansi characters get mangled, as shown here in my terminal:当我使用open(file, encoding="utf-8-sig").read()非ansi 字符会被破坏，如我的终端中所示：

mangled characters when encoding of open() is set to "utf-8-sig" open()编码设置为"utf-8-sig"时损坏的字符

However, when I set the encoding to "ansi" , the characters are perfectly normal!但是，当我将编码设置为"ansi" ，字符完全正常！

normal characters with encoding="ansi" encoding="ansi"普通字符

This is a complete mystery to me.这对我来说完全是个谜。 As said before, this worked fine yesterday.如前所述，昨天这工作正常。 I've checked that the files were indeed utf-8, multiple times.我已经多次检查过这些文件确实是 utf-8。 I don't know if the problem is with the open() function, or the print() function when the characters are displayed.不知道是open()函数的问题，还是显示字符时print()函数的问题。 in any case, it's strange.无论如何，这很奇怪。 The "ansi" version would be a solution, but the problem is that it causes problems with Lark , which uses the contents of the opened files. "ansi"版本将是一个解决方案，但问题是它会导致Lark出现问题，它使用打开的文件的内容。

In the screenshots I gave here, the code is basic:在我在这里给出的屏幕截图中，代码是基本的：

with open(str(GRAMMAR), "r", encoding="utf-8-sig") as grammar:
    print(grammar.read())

What could this problem be caused by?什么可能这个问题所致？

Answer 1

I just noticed something: ansi is not an encoding.我刚刚注意到一些事情： ansi不是编码。 The correct name for the encoding would be ascii .编码的正确名称是ascii 。 This means that when I typed encoding="ansi" , python ignored the encoding I asked it to set and read the file as its default encoding, which is normally utf-8.这意味着当我输入encoding="ansi" ，python 忽略了我要求它设置和读取文件作为其默认编码的编码，通常是 utf-8。 This does not explain why it doesn't work with utf-8-sig or why Lark is screaming at me, but this is specific to my case.这并不能解释为什么它不适用于utf-8-sig或为什么 Lark 对我尖叫，但这特定于我的情况。 So for future readers of this questions, check 2 things:因此，对于此问题的未来读者，请检查两件事：

If you want to use ascii, type ascii , not ansi .如果要使用 ascii，请键入ascii ，而不是ansi 。
Stick with the defaults.坚持使用默认值。

Answer 2

On Windows machines, Python recognises the name "ansi" as an alias for the "mbcs" codec, defined as在 Windows 机器上，Python 将名称“ansi”识别为“mbcs”编解码器的别名，定义为

Windows only: Encode the operand according to the ANSI codepage (CP_ACP).仅限 Windows：根据 ANSI 代码页 (CP_ACP) 对操作数进行编码。

So ansi is a valid encoding, but it isn't the same as ASCII, or UTF-8, hence the apparent mangling.所以 ansi 是一种有效的编码，但它与 ASCII 或 UTF-8 不同，因此存在明显的错误。

为什么 python 的 open() 函数会破坏我的 utf-8 文件？

问题描述

2 个解决方案

解决方案1
0 2021-10-20 12:29:59

解决方案2
0 2021-10-20 15:31:54

为什么 python 的 open() 函数会破坏我的 utf-8 文件？

问题描述

2 个解决方案

解决方案1 0 2021-10-20 12:29:59

解决方案2 0 2021-10-20 15:31:54

解决方案1
0 2021-10-20 12:29:59

解决方案2
0 2021-10-20 15:31:54