UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 4

Question

I am using python 2.7.3, and trying to read a text, count the words in it, and write the words along with the counts toa text file. The input file (xml) has the following input:

But what my right hon. Friend the Member for Chingford (Mr. Tebbit) did not knowߞneither did I-was Mr. Lynn's record as a politician.

I keep getting the notorious error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 4: ordinal not in range(128), which I believe is the result of my failure to decode/encode this character entity . The relevant code is:

import codecs, sys
sys.stdout = codecs.lookup('utf-8')[-1](sys.stdout)  
f = open(fullfile, 'rU')#, 'rU')#read as unicode
Sraw = f.read()
Sraw = Sraw.decode('utf8','ignore').encode('utf8','ignore')# modified, doesn't help

The program dies when I try to append the words to a list (or print them, eg):

words =(nltk.wordpunct_tokenize(sentence.strip()))
dwords.extend(words)

I know that decode is used to convert strings to unicode and encode is supposed to do the opposite and tried to change my code accordingly, but can't figure out how to fix this. Any advice is greatly appreciated.

Answer 1

use unidecode package

from unidecode import unidecode 
Sraw = unidecode(f.read())

will do the trick.

Answer 2

U is not for unicode support, its for universal newlines :

In addition to the standard fopen() values mode may be 'U' or 'rU'. Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\\n', the Macintosh convention '\\r', or the Windows convention '\\r\\n'. All of these external representations are seen as '\\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\\n', '\\r', '\\r\\n', or a tuple containing all the newline types seen.

If your file is encoded with utf-8, you need to open it with codecs.open , give it the correct encoding:

import codecs

with codecs.open(filename, mode='r', encoding='utf-8') as f:
    for line in f:
       # do stuff

I know that decode is used to convert strings to unicode and encode is supposed to do the opposite

Actually that's not entirely true, think of it like this:

Decode means "take something, and return the bytes"
Encode means "take bytes, and convert them into characters"

How to map what each character should be translated to, this is the "encoding"; and here is where you specify utf-8 or other encodings. This is so when its decoding, it knows to lookup the character in the correct table to get the byte value; similarly when encoding it knows to lookup the byte and then convert it to the correct character.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 4

Question

2 answers

solution1
2 ACCPTED 2014-08-04 11:04:43

solution2
1 2014-08-04 11:22:54

UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 4

Question

2 answers

solution1 2 ACCPTED 2014-08-04 11:04:43

solution2 1 2014-08-04 11:22:54

solution1
2 ACCPTED 2014-08-04 11:04:43

solution2
1 2014-08-04 11:22:54