简体   繁体   English

Python - 用奇怪的utf-16格式读取文本文件

[英]Python - read text file with weird utf-16 format

I'm trying to read a text file into python, but it seems to use some very strange encoding. 我正在尝试将文本文件读入python,但它似乎使用了一些非常奇怪的编码。 I try the usual: 我试着平常:

file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:
    print line,
    print line.split()

Output: 输出:

0.0200197   1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. 打印线条工作正常,但在我尝试拆分线以便我可以将其转换为浮动之后,它看起来很疯狂。 Of course, when I try to convert those strings to floats, this produces an error. 当然,当我尝试将这些字符串转换为浮点数时,会产生错误。 Any idea about how I can convert these back into numbers? 有关如何将这些转换为数字的任何想法?

I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt 如果你想尝试加载它,我把示例数据文件放在这里: https//dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file. 我想简单地使用numpy.loadtxt或numpy.genfromtxt,但他们也不想处理这个疯狂的文件。

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is. 我愿意打赌这是一个UTF-16-LE文件,无论你的默认编码是什么,你都在阅读它。

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\\x00' after each character. 在UTF-16中,每个字符占用两个字节。*如果您的字符都是ASCII,这意味着UTF-16编码看起来像ASCII编码,每个字符后面加一个'\\ x00'。

To fix this, just decode the data: 要解决此问题,只需解码数据:

print line.decode('utf-16-le').split()

Or do the same thing at the file level with the io or codecs module: 或者使用io或codecs模块在文件级别执行相同的操作:

file = io.open('data.txt','r', encoding='utf-16-le')

* This is a bit of an oversimplification: Each BMP character takes two bytes; *这有点过于简单化:每个BMP字符占用两个字节; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. 每个非BMP字符都变成代理对,两个代理中的每一个都占用两个字节。 But you probably didn't care about these details. 但你可能并不关心这些细节。

Looks like UTF-16 to me. 看起来像UTF-16给我。

>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'

You can work directly off the Unicode strings: 您可以直接使用Unicode字符串:

>>> float(test_utf16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001

Or encode them to something different, if you prefer: 如果您愿意,可以将它们编码为不同的东西:

>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001

Note that you need to do this as early as possible in your processing. 请注意,您需要在处理过程中尽早执行此操作。 As your comment noted, split will behave incorrectly on the utf-16 encoded form. 正如您的评论所指出的, split在utf-16编码表单上的行为不正确。 The utf-16 representation of the space character ' ' is ' \\x00' , so split removes the whitespace but leaves the null byte. 空格字符' '的utf-16表示形式为' \\x00' ,因此split会删除空格但留下空字节。

The 2.6 and later io library can handle this for you, as can the older codecs library. 2.6及更高版本的io库可以为您处理此问题,旧的codecs库也可以。 io handles linefeeds better, so it's preferable if available. io更好地处理换行,所以如果可用的话,它更可取。

This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using: 这实际上只是@ abarnert的建议,但我想将其作为答案发布,因为这是最简单的解决方案,也是我最终使用的解决方案:

    file = io.open(filename,'r',encoding='utf-16-le')
    data = np.loadtxt(file,skiprows=8)

This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading. 这演示了如何使用io.open创建文件对象,使用您的文件碰巧具有的任何疯狂编码,然后将该文件对象传递给np.loadtxt(或np.genfromtxt)以便快速轻松地加载。

This piece of code will do the necessary 这段代码将做必要的

file_handle=open(file_name,'rb')
file_first_line=file_handle.readline()
file_handle.close()
print file_first_line
if '\x00' in file_first_line:
    file_first_line=file_first_line.replace('\x00','')
    print file_first_line

When you try to use 'file_first_line.split()' before replacing, the output would contain '\\x00' i just tried replacing '\\x00' with empty and it worked. 当您在替换之前尝试使用'file_first_line.split()'时,输出将包含'\\ x00'我只是尝试用空替换'\\ x00'并且它有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM