简体   繁体   中英

python: unicode problem

I am trying to decode a string I took from file:

file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]

'\\xff\\xfeK\\x00e\\x00y\\x00w\\x00o\\x00r\\x00d\\x00\\t\\x00C\\x00o\\x00m\\x00p\\x00e\\x00t\\x00i\\x00t\\x00i\\x00o\\x00n\\x00\\t\\x00G\\x00l\\x00o\\x00b\\x00a\\x00l\\x00 \\x00M\\x00o\\x00n\\x00t\\x00h\\x00l\\x00y\\x00 \\x00S\\x00e\\x00a\\x00r\\x00c\\x00h\\x00e\\x00s\\x00\\t\\x00D\\x00e\\x00c\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00N\\x00o\\x00v\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00O\\x00c\\x00t\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00S\\x00e\\x00p\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00A\\x00u\\x00g\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00J\\x00u\\x00l\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00J\\x00u\\x00n\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00M\\x00a\\x00y\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00A\\x00p\\x00r\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00M\\x00a\\x00r\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00F\\x00e\\x00b\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00J\\x00a\\x00n\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00A\\x00d\\x00 \\x00s\\x00h\\x00a\\x00r\\x00e\\x00\\t\\x00S\\x00e\\x00a\\x00r\\x00c\\x00h\\x00 \\x00s\\x00h\\x00a\\x00r\\x00e\\x00\\t\\x00E\\x00s\\x00t\\x00i\\x00m\\x00a\\x00t\\x00e\\x00d\\x00 \\x00A\\x00v\\x00g\\x00.\\x00 \\x00C\\x00P\\x00C\\x00\\t\\x00E\\x00x\\x 00t\\x00r\\x00a\\x00c\\x00t\\x00e\\x00d\\x00 \\x00F\\x00r\\x00o\\x00m\\x00 \\x00W\\x00e\\x00b\\x00 \\x00P\\x00a\\x00g\\x00e\\x00\\t\\x00L\\x00o\\x00c\\x00a\\x00l\\x00 \\x00M\\x00o\\x00n\\x00t\\x00h\\x00l\\x00y\\x00 \\x00S\\x00e\\x00a\\x00r\\x00c\\x00h\\x00e\\x00s\\x00\\n'

Adding ignore do not really help...:

In [69]: data[2] Out[69]: u'\最\愀\爀\搀\攀\渀\ \氀\愀\洀\瀀\ \瀀\漀\猀\琀\ऀ\ \⸀\㤀\㐀\ऀ\㠀\㠀\ \ऀ\ⴀ\ऀ\㌀\㈀\ \ऀ\㌀\㤀\ \ऀ\㌀\㤀\ \ऀ\㐀\㠀\ \ऀ\㔀\㤀\ \ऀ\㔀\㤀\ \ऀ\㜀\㈀\ \ऀ\㜀\㈀\ \ऀ\㌀\㤀\ \ऀ\㌀\㈀\ \ऀ\㈀\㘀\ \ऀ\ⴀ\ऀ\ⴀ\ऀ\ꌀ\㈀\⸀\㄀\㠀\ऀ\ⴀ\ऀ\㐀\㠀\ \਀'

In [70]: data[2].decode("utf-8", "replace") --------------------------------------------------------------------------- Traceback (most recent call last)

/Users/oleg/ in ()

/opt/local/lib/python2.5/encodings/utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder):

: 'ascii' codec can't encode characters in position 0-87: ordinal not in range(128)

In [71]:

This looks like UTF-16 data. So try

data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file at once, that is

data = open(...).read()
data.decode("utf-16")

The problem is that the line breaks in UTF-16 are "\\n\\x00", but using readlines() will split at the "\\n", leaving the "\\x00" character for the next line.

This file is a UTF-16-LE encoded file, with an initial BOM.

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

EDIT

Since you posted 2.7 this is the 2.7 solution:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

Ignoring undecodeable characters:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM