UTF-8解码ANSI编码的文件会引发错误

Question

Here's something I'm trying to understand. 这是我想了解的东西。 I was under the impression that UTF-8 was backwards compatible, so that I can always decode a text file with UTF-8, even if it's an ANSI file. 我的印象是UTF-8向后兼容，因此即使它是ANSI文件，我也可以始终使用UTF-8解码文本文件。 But that doesn't seem to be the case: 但这似乎并非如此：

In [1]: ansi_str = 'éµaØc'

In [2]: with open('test.txt', 'w', encoding='ansi') as f:
   ...:     f.write(ansi_str)
   ...:

In [3]: with open('test.txt', 'r', encoding='utf-8') as f:
   ...:     print(f.read())
   ...:
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-b0711b7b947e> in <module>
      1 with open('test.txt', 'r', encoding='utf-8') as f:
----> 2     print(f.read())
      3

c:\program files\python37\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

So it looks like if my code expects UTF-8, and is likely to encounter an ANSI-encoded file, I need to handle the UnicodeDecodeError. 因此，如果我的代码希望使用UTF-8，并且可能遇到ANSI编码的文件，则需要处理UnicodeDecodeError。 That's fine, but I would appreciate if anyone could throw some light on my initial misunderstanding. 很好，但是如果有人能对我最初的误解有所了解，我将不胜感激。

Thanks! 谢谢！

Answer 1

UTF-8 is backwards compatible with ASCII . UTF-8向后兼容ASCII 。 Not ANSI. 不是ANSI。 "ANSI" doesn't even describe any one particular encoding . “ ANSI”甚至没有描述任何一种特定的编码。 And those characters you're testing with are well outside the ASCII range, so unless you actually encode them with UTF-8, you can't read them as UTF-8. 而且您要测试的那些字符都在ASCII范围之外，因此，除非您实际使用UTF-8对其进行编码，否则无法将它们读取为UTF-8。

UTF-8解码ANSI编码的文件会引发错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-06-04 14:00:59

UTF-8解码ANSI编码的文件会引发错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-06-04 14:00:59

解决方案1
1 已采纳 2019-06-04 14:00:59