Python - 用奇怪的utf-16格式读取文本文件

Question

I'm trying to read a text file into python, but it seems to use some very strange encoding. 我正在尝试将文本文件读入python，但它似乎使用了一些非常奇怪的编码。 I try the usual: 我试着平常：

file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:
    print line,
    print line.split()

Output: 输出：

0.0200197   1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. 打印线条工作正常，但在我尝试拆分线以便我可以将其转换为浮动之后，它看起来很疯狂。 Of course, when I try to convert those strings to floats, this produces an error. 当然，当我尝试将这些字符串转换为浮点数时，会产生错误。 Any idea about how I can convert these back into numbers? 有关如何将这些转换为数字的任何想法？

I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt 如果你想尝试加载它，我把示例数据文件放在这里： https ： //dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file. 我想简单地使用numpy.loadtxt或numpy.genfromtxt，但他们也不想处理这个疯狂的文件。

Answer 1

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is. 我愿意打赌这是一个UTF-16-LE文件，无论你的默认编码是什么，你都在阅读它。

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\\x00' after each character. 在UTF-16中，每个字符占用两个字节。*如果您的字符都是ASCII，这意味着UTF-16编码看起来像ASCII编码，每个字符后面加一个'\\ x00'。

To fix this, just decode the data: 要解决此问题，只需解码数据：

print line.decode('utf-16-le').split()

Or do the same thing at the file level with the io or codecs module: 或者使用io或codecs模块在文件级别执行相同的操作：

file = io.open('data.txt','r', encoding='utf-16-le')

* This is a bit of an oversimplification: Each BMP character takes two bytes; *这有点过于简单化：每个BMP字符占用两个字节; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. 每个非BMP字符都变成代理对，两个代理中的每一个都占用两个字节。 But you probably didn't care about these details. 但你可能并不关心这些细节。

Answer 2

Looks like UTF-16 to me. 看起来像UTF-16给我。

>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'

You can work directly off the Unicode strings: 您可以直接使用Unicode字符串：

>>> float(test_utf16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001

Or encode them to something different, if you prefer: 如果您愿意，可以将它们编码为不同的东西：

>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001

Note that you need to do this as early as possible in your processing. 请注意，您需要在处理过程中尽早执行此操作。 As your comment noted, split will behave incorrectly on the utf-16 encoded form. 正如您的评论所指出的， split在utf-16编码表单上的行为不正确。 The utf-16 representation of the space character ' ' is ' \\x00' , so split removes the whitespace but leaves the null byte. 空格字符' '的utf-16表示形式为' \\x00' ，因此split会删除空格但留下空字节。

The 2.6 and later io library can handle this for you, as can the older codecs library. 2.6及更高版本的io库可以为您处理此问题，旧的codecs库也可以。 io handles linefeeds better, so it's preferable if available. io更好地处理换行，所以如果可用的话，它更可取。

Answer 3

This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using: 这实际上只是@ abarnert的建议，但我想将其作为答案发布，因为这是最简单的解决方案，也是我最终使用的解决方案：

    file = io.open(filename,'r',encoding='utf-16-le')
    data = np.loadtxt(file,skiprows=8)

This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading. 这演示了如何使用io.open创建文件对象，使用您的文件碰巧具有的任何疯狂编码，然后将该文件对象传递给np.loadtxt（或np.genfromtxt）以便快速轻松地加载。

Answer 4

This piece of code will do the necessary 这段代码将做必要的

file_handle=open(file_name,'rb')
file_first_line=file_handle.readline()
file_handle.close()
print file_first_line
if '\x00' in file_first_line:
    file_first_line=file_first_line.replace('\x00','')
    print file_first_line

When you try to use 'file_first_line.split()' before replacing, the output would contain '\\x00' i just tried replacing '\\x00' with empty and it worked. 当您在替换之前尝试使用'file_first_line.split（）'时，输出将包含'\\ x00'我只是尝试用空替换'\\ x00'并且它有效。

Python - 用奇怪的utf-16格式读取文本文件

问题描述

4 个解决方案

解决方案1
16 已采纳 2013-10-11 23:50:06

解决方案2
2 2013-10-11 23:48:58

解决方案3
1 2015-09-03 15:41:21

解决方案4
0 2017-01-31 12:09:14

Python - 用奇怪的utf-16格式读取文本文件

问题描述

4 个解决方案

解决方案1 16 已采纳 2013-10-11 23:50:06

解决方案2 2 2013-10-11 23:48:58

解决方案3 1 2015-09-03 15:41:21

解决方案4 0 2017-01-31 12:09:14

解决方案1
16 已采纳 2013-10-11 23:50:06

解决方案2
2 2013-10-11 23:48:58

解决方案3
1 2015-09-03 15:41:21

解决方案4
0 2017-01-31 12:09:14