“ io.open”未正确打开UTF-16文件

Question

io.open is supposed to be stripping preambles when opening files in various encodings. 当打开各种编码的文件时， io.open应该会剥离前导。

For instance, the following file encoded with UTF-8-SIG has the preamble stripped correctly before reading it into a string: 例如，以下使用UTF-8-SIG编码的文件在将其读入字符串之前，已正确剥离了前导码：

(Note: I'm not opening these files in binary mode. The first line of these logs is to demonstrate the contents of the files that are about to be read.) （注意：我不是以二进制方式打开这些文件。这些日志的第一行是演示将要读取的文件的内容。）

# Raw binary, so you can see that it's a proper UTF-8-SIG encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xef\xbb\xbf"EventId","Rate","Attribute1","Attribute2","(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89"\r\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-8-SIG').readline()
u'"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

But while this file with a UTF-16LE encoding is being successfully opened, the preamble is coming with it: 但是，在成功打开带有UTF-16LE编码的此文件时，随之而来的是序言：

# Raw binary, so you can see that it's a proper UTF-16LE encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xff\xfe"\x00E\x00v\x00e\x00n\x00t\x00I\x00d\x00"\x00,\x00"\x00R\x00a\x00t\x00e\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x001\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x002\x00"\x00,\x00"\x00(\x00a\xffe\xff\xc9\x03e\xffa\xff)\x00\x89\xff"\x00\r\x00\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-16LE').readline()
u'\ufeff"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

This goes on to break file validation that expects the file contents to start right off with "EventId"... 这样会中断文件验证，该验证程序期望文件内容从"EventId"...

Am I opening this file incorrectly? 我打开文件不正确吗？

Note that I'm not satisfied having to manually strip out preambles after opening the file - I want to support arbitrary encodings and I expect io.open (with the correct encoding supplied, as determined by chardet) to abstract away the need for me to have a bunch of hard coded preambles to skip if encountered at the beginning of the first line. 请注意，我不满意在打开文件后必须手动剥离前导码-我想支持任意编码，并且我希望io.open （提供的正确编码由chardet确定）会抽象出对我的需求如果在第一行的开头遇到许多硬编码的前导，则可以跳过。

Answer 1

根据此答案，您需要使用UTF-16而不是UTF-16LE 。

io.open(csv_file_path, encoding='UTF-16').readline()

“ io.open”未正确打开UTF-16文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-10-17 23:21:40

“ io.open”未正确打开UTF-16文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-10-17 23:21:40

解决方案1
2 已采纳 2014-10-17 23:21:40