Python gets the wrong encoding for UTF-8 characters?

Question

I'm trying to fetch text with special characters from a website, and the string Python returns is therefore full of "\\x" characters. However, it seems that the encoding is wrong. For example, when fetching :

th =urllib2.urlopen('http://norse.ulver.com/dct/zoega/th.html')

the line at level <h1> of the webpage should contain the letter "Þ", which has byte number C39E and Unicode code DE according to http://www.fileformat.info/info/charset/UTF-8/list.htm

Instead, I get

'<h1>\xc3\x9e</h1>'

with the byte number split in two, so that when writing the line to a file and then opening it with a Unicode encoding, I get "Ãž" instead of "Þ".

How can I force Python to encode such a character as \쎞 or \\xde instead of \\xc3\\x9e ?

Answer 1

That's the correct UTF-8 byte encoding of U+00DE and it takes two bytes to represent it ( \\xc3 and \\x9e ), but you need to decode it to Unicode to see the Unicode codepoint:

>>> '<h1>\xc3\x9e</h1>'.decode('utf8')
u'<h1>\xde</h1>'

The above is a Unicode string showing the correct Unicode codepoint. Printing it on UTF-8 console:

>>> print '<h1>\xc3\x9e</h1>'.decode('utf8')
<h1>Þ</h1>

If you use the wrong encoding to decode you get different Unicode codepoints. In this case U+00C3 and U+017E. \\xc3 is an escape code in a Unicode string for Unicode codepoints < U+0100 whereas \ž is one for codepoints < U+10000:

>>> '<h1>\xc3\x9e</h1>'.decode('cp1252')
u'<h1>\xc3\u017e</h1>'
>>> print '<h1>\xc3\x9e</h1>'.decode('cp1252')
<h1>Ãž</h1>

Recommended reading:

Python gets the wrong encoding for UTF-8 characters?

Question

1 answers

solution1
1 ACCPTED 2016-01-09 16:39:25

Python gets the wrong encoding for UTF-8 characters?

Question

1 answers

solution1 1 ACCPTED 2016-01-09 16:39:25

solution1
1 ACCPTED 2016-01-09 16:39:25