python: unicode problem

Question

I am trying to decode a string I took from file:

file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]

'\\xff\\xfeK\\x00e\\x00y\\x00w\\x00o\\x00r\\x00d\\x00\\t\\x00C\\x00o\\x00m\\x00p\\x00e\\x00t\\x00i\\x00t\\x00i\\x00o\\x00n\\x00\\t\\x00G\\x00l\\x00o\\x00b\\x00a\\x00l\\x00 \\x00M\\x00o\\x00n\\x00t\\x00h\\x00l\\x00y\\x00 \\x00S\\x00e\\x00a\\x00r\\x00c\\x00h\\x00e\\x00s\\x00\\t\\x00D\\x00e\\x00c\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00N\\x00o\\x00v\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00O\\x00c\\x00t\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00S\\x00e\\x00p\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00A\\x00u\\x00g\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00J\\x00u\\x00l\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00J\\x00u\\x00n\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00M\\x00a\\x00y\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00A\\x00p\\x00r\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00M\\x00a\\x00r\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00F\\x00e\\x00b\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00J\\x00a\\x00n\\x00 \\x002\\x000\\x001\\x000\\x00\\t\\x00A\\x00d\\x00 \\x00s\\x00h\\x00a\\x00r\\x00e\\x00\\t\\x00S\\x00e\\x00a\\x00r\\x00c\\x00h\\x00 \\x00s\\x00h\\x00a\\x00r\\x00e\\x00\\t\\x00E\\x00s\\x00t\\x00i\\x00m\\x00a\\x00t\\x00e\\x00d\\x00 \\x00A\\x00v\\x00g\\x00.\\x00 \\x00C\\x00P\\x00C\\x00\\t\\x00E\\x00x\\x 00t\\x00r\\x00a\\x00c\\x00t\\x00e\\x00d\\x00 \\x00F\\x00r\\x00o\\x00m\\x00 \\x00W\\x00e\\x00b\\x00 \\x00P\\x00a\\x00g\\x00e\\x00\\t\\x00L\\x00o\\x00c\\x00a\\x00l\\x00 \\x00M\\x00o\\x00n\\x00t\\x00h\\x00l\\x00y\\x00 \\x00S\\x00e\\x00a\\x00r\\x00c\\x00h\\x00e\\x00s\\x00\\n'

Adding ignore do not really help...:

In [69]: data[2] Out[69]: u'\最\愀\爀\搀\攀\渀\ \氀\愀\洀\瀀\ \瀀\漀\猀\琀\ऀ\　\⸀\㤀\㐀\ऀ\㠀\㠀\　\ऀ\ⴀ\ऀ\㌀\㈀\　\ऀ\㌀\㤀\　\ऀ\㌀\㤀\　\ऀ\㐀\㠀\　\ऀ\㔀\㤀\　\ऀ\㔀\㤀\　\ऀ\㜀\㈀\　\ऀ\㜀\㈀\　\ऀ\㌀\㤀\　\ऀ\㌀\㈀\　\ऀ\㈀\㘀\　\ऀ\ⴀ\ऀ\ⴀ\ऀ\ꌀ\㈀\⸀\㄀\㠀\ऀ\ⴀ\ऀ\㐀\㠀\　\਀'

In [70]: data[2].decode("utf-8", "replace") --------------------------------------------------------------------------- Traceback (most recent call last)

/Users/oleg/ in ()

/opt/local/lib/python2.5/encodings/utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder):

: 'ascii' codec can't encode characters in position 0-87: ordinal not in range(128)

In [71]:

Answer 1

This looks like UTF-16 data. So try

data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file at once, that is

data = open(...).read()
data.decode("utf-16")

The problem is that the line breaks in UTF-16 are "\\n\\x00", but using readlines() will split at the "\\n", leaving the "\\x00" character for the next line.

Answer 2

This file is a UTF-16-LE encoded file, with an initial BOM.

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

Answer 3

EDIT

Since you posted 2.7 this is the 2.7 solution:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

Ignoring undecodeable characters:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

python: unicode problem

Question

3 answers

solution1
15 ACCPTED 2011-01-19 13:10:53

solution2
7 2011-02-13 11:42:21

solution3
3 2011-01-19 13:08:16

python: unicode problem

Question

3 answers

solution1 15 ACCPTED 2011-01-19 13:10:53

solution2 7 2011-02-13 11:42:21

solution3 3 2011-01-19 13:08:16

solution1
15 ACCPTED 2011-01-19 13:10:53

solution2
7 2011-02-13 11:42:21

solution3
3 2011-01-19 13:08:16