I'm trying to fetch text with special characters from a website, and the string Python returns is therefore full of "\\x" characters. However, it seems that the encoding is wrong. For example, when fetching :
th =urllib2.urlopen('http://norse.ulver.com/dct/zoega/th.html')
the line at level <h1>
of the webpage should contain the letter "Þ", which has byte number C39E and Unicode code DE according to http://www.fileformat.info/info/charset/UTF-8/list.htm
Instead, I get
'<h1>\xc3\x9e</h1>'
with the byte number split in two, so that when writing the line to a file and then opening it with a Unicode encoding, I get "Þ" instead of "Þ".
How can I force Python to encode such a character as \쎞
or \\xde
instead of \\xc3\\x9e
?
That's the correct UTF-8 byte encoding of U+00DE and it takes two bytes to represent it ( \\xc3
and \\x9e
), but you need to decode it to Unicode to see the Unicode codepoint:
>>> '<h1>\xc3\x9e</h1>'.decode('utf8')
u'<h1>\xde</h1>'
The above is a Unicode string showing the correct Unicode codepoint. Printing it on UTF-8 console:
>>> print '<h1>\xc3\x9e</h1>'.decode('utf8')
<h1>Þ</h1>
If you use the wrong encoding to decode you get different Unicode codepoints. In this case U+00C3 and U+017E. \\xc3
is an escape code in a Unicode string for Unicode codepoints < U+0100 whereas \ž
is one for codepoints < U+10000:
>>> '<h1>\xc3\x9e</h1>'.decode('cp1252')
u'<h1>\xc3\u017e</h1>'
>>> print '<h1>\xc3\x9e</h1>'.decode('cp1252')
<h1>Þ</h1>
Recommended reading:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.