“utf-8-sig”是否适合解码 UTF-8 和 UTF-8 BOM？

Question

I am using the Python CSV library to read two CSV files.我正在使用 Python CSV 库读取两个 CSV 文件。

One is encoded with UTF-8-BOM, another is encoded with UTF-8. In my practice, I found that both files could be read by using "utf-8-sig" as encoding type:一个是用UTF-8-BOM编码的，另一个是用UTF-8编码的。在我的实践中，我发现这两个文件都可以使用“utf-8-sig”作为编码类型来读取：

from csv import reader 
with open(file_path, encoding='utf-8-sig') as csv_file:
    c_reader = reader(csv_file, delimiter=',')
    headers = next(c_reader)    
    for row in c_reader:
        print(row)

I want to confirm, is "utf-8-sig" suitable for decoding both UTF-8 and UTF-8 BOM?我想确认一下，“utf-8-sig”是否适合解码 UTF-8 和 UTF-8 BOM？ I am using Python version 3.6 and 3.7.我正在使用 Python 版本 3.6 和 3.7。 Thanks for your answers!感谢您的回答！

Answer 1

The utf-8-sig codec will decode both utf-8-sig-encoded text and text encoded with the standard utf-8 encoding utf-8-sig 编解码器将解码 utf-8-sig 编码的文本和使用标准 utf-8 编码的文本

>>> s = 'Straße'
>>> utf8_sig = s.encode('utf-8-sig')
>>> utf8 = s.encode('utf')
>>> print(utf8_sig.decode('utf-8-sig'))
Straße
>>> print(utf8.decode('utf-8-sig'))
Straße

From the codecs docs :从编解码器文档：

Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written... On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.在将 Unicode 个字符中的任何一个写入文件之前，将写入一个 UTF-8 编码的 BOM（看起来像这样的字节序列：0xef、0xbb、0xbf）...在解码 utf-8-sig 时将跳过这三个字节，如果它们显示为文件中的前三个字节。

The utf-8-sig encoding in most common in Windows environments. utf-8-sig 编码在 Windows 环境中最常见。 If you're sharing files with users on mac or *nix systems, the standard utf-8 encoding is what they would expect to receive.如果您在 mac 或 *nix 系统上与用户共享文件，标准 utf-8 编码是他们希望收到的。

“utf-8-sig”是否适合解码 UTF-8 和 UTF-8 BOM？

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-08-20 15:53:15

“utf-8-sig”是否适合解码 UTF-8 和 UTF-8 BOM？

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-08-20 15:53:15

解决方案1
4 已采纳 2020-08-20 15:53:15