简体   繁体   中英

Why does python's open() function mangle my utf-8 files?

This is a strange one, and it might be due to a python update, because it worked fine yesterday with no changes. Here we go:

I have a program that opens utf-8 files (that use accented characters, etc, not just ansi characters). When I open the files with open(file, encoding="utf-8-sig").read() , the non-ansi characters get mangled, as shown here in my terminal:

mangled characters when encoding of open() is set to "utf-8-sig"

However, when I set the encoding to "ansi" , the characters are perfectly normal!

normal characters with encoding="ansi"

This is a complete mystery to me. As said before, this worked fine yesterday. I've checked that the files were indeed utf-8, multiple times. I don't know if the problem is with the open() function, or the print() function when the characters are displayed. in any case, it's strange. The "ansi" version would be a solution, but the problem is that it causes problems with Lark , which uses the contents of the opened files.

In the screenshots I gave here, the code is basic:

with open(str(GRAMMAR), "r", encoding="utf-8-sig") as grammar:
    print(grammar.read())

What could this problem be caused by?

I just noticed something: ansi is not an encoding. The correct name for the encoding would be ascii . This means that when I typed encoding="ansi" , python ignored the encoding I asked it to set and read the file as its default encoding, which is normally utf-8. This does not explain why it doesn't work with utf-8-sig or why Lark is screaming at me, but this is specific to my case. So for future readers of this questions, check 2 things:

  1. If you want to use ascii, type ascii , not ansi .
  2. Stick with the defaults.

On Windows machines, Python recognises the name "ansi" as an alias for the "mbcs" codec, defined as

Windows only: Encode the operand according to the ANSI codepage (CP_ACP).

So ansi is a valid encoding, but it isn't the same as ASCII, or UTF-8, hence the apparent mangling.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM