简体   繁体   English

为什么 python 的 open() 函数会破坏我的 utf-8 文件?

[英]Why does python's open() function mangle my utf-8 files?

This is a strange one, and it might be due to a python update, because it worked fine yesterday with no changes.这是一个奇怪的问题,可能是由于 python 更新,因为它昨天运行良好,没有任何变化。 Here we go:开始了:

I have a program that opens utf-8 files (that use accented characters, etc, not just ansi characters).我有一个程序可以打开 utf-8 文件(使用重音字符等,而不仅仅是 ansi 字符)。 When I open the files with open(file, encoding="utf-8-sig").read() , the non-ansi characters get mangled, as shown here in my terminal:当我使用open(file, encoding="utf-8-sig").read()非ansi 字符会被破坏,如我的终端中所示:

mangled characters when encoding of open() is set to "utf-8-sig" open()编码设置为"utf-8-sig"时损坏的字符

However, when I set the encoding to "ansi" , the characters are perfectly normal!但是,当我将编码设置为"ansi" ,字符完全正常!

normal characters with encoding="ansi" encoding="ansi"普通字符

This is a complete mystery to me.这对我来说完全是个谜。 As said before, this worked fine yesterday.如前所述,昨天这工作正常。 I've checked that the files were indeed utf-8, multiple times.我已经多次检查过这些文件确实是 utf-8。 I don't know if the problem is with the open() function, or the print() function when the characters are displayed.不知道是open()函数的问题,还是显示字符时print()函数的问题。 in any case, it's strange.无论如何,这很奇怪。 The "ansi" version would be a solution, but the problem is that it causes problems with Lark , which uses the contents of the opened files. "ansi"版本将是一个解决方案,但问题是它会导致Lark出现问题,它使用打开的文件的内容。

In the screenshots I gave here, the code is basic:在我在这里给出的屏幕截图中,代码是基本的:

with open(str(GRAMMAR), "r", encoding="utf-8-sig") as grammar:
    print(grammar.read())

What could this problem be caused by?什么可能这个问题所致?

I just noticed something: ansi is not an encoding.我刚刚注意到一些事情: ansi不是编码。 The correct name for the encoding would be ascii .编码的正确名称是ascii This means that when I typed encoding="ansi" , python ignored the encoding I asked it to set and read the file as its default encoding, which is normally utf-8.这意味着当我输入encoding="ansi" ,python 忽略了我要求它设置和读取文件作为其默认编码的编码,通常是 utf-8。 This does not explain why it doesn't work with utf-8-sig or why Lark is screaming at me, but this is specific to my case.这并不能解释为什么它不适用于utf-8-sig或为什么 Lark 对我尖叫,但这特定于我的情况。 So for future readers of this questions, check 2 things:因此,对于此问题的未来读者,请检查两件事:

  1. If you want to use ascii, type ascii , not ansi .如果要使用 ascii,请键入ascii ,而不是ansi
  2. Stick with the defaults.坚持使用默认值。

On Windows machines, Python recognises the name "ansi" as an alias for the "mbcs" codec, defined as在 Windows 机器上,Python 将名称“ansi”识别为“mbcs”编解码器的别名,定义为

Windows only: Encode the operand according to the ANSI codepage (CP_ACP).仅限 Windows:根据 ANSI 代码页 (CP_ACP) 对操作数进行编码。

So ansi is a valid encoding, but it isn't the same as ASCII, or UTF-8, hence the apparent mangling.所以 ansi 是一种有效的编码,但它与 ASCII 或 UTF-8 不同,因此存在明显的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM