简体   繁体   中英

Python gets the wrong encoding for UTF-8 characters?

I'm trying to fetch text with special characters from a website, and the string Python returns is therefore full of "\\x" characters. However, it seems that the encoding is wrong. For example, when fetching :

th =urllib2.urlopen('http://norse.ulver.com/dct/zoega/th.html')

the line at level <h1> of the webpage should contain the letter "Þ", which has byte number C39E and Unicode code DE according to http://www.fileformat.info/info/charset/UTF-8/list.htm

Instead, I get

'<h1>\xc3\x9e</h1>'

with the byte number split in two, so that when writing the line to a file and then opening it with a Unicode encoding, I get "Þ" instead of "Þ".

How can I force Python to encode such a character as \쎞 or \\xde instead of \\xc3\\x9e ?

That's the correct UTF-8 byte encoding of U+00DE and it takes two bytes to represent it ( \\xc3 and \\x9e ), but you need to decode it to Unicode to see the Unicode codepoint:

>>> '<h1>\xc3\x9e</h1>'.decode('utf8')
u'<h1>\xde</h1>'

The above is a Unicode string showing the correct Unicode codepoint. Printing it on UTF-8 console:

>>> print '<h1>\xc3\x9e</h1>'.decode('utf8')
<h1>Þ</h1>

If you use the wrong encoding to decode you get different Unicode codepoints. In this case U+00C3 and U+017E. \\xc3 is an escape code in a Unicode string for Unicode codepoints < U+0100 whereas is one for codepoints < U+10000:

>>> '<h1>\xc3\x9e</h1>'.decode('cp1252')
u'<h1>\xc3\u017e</h1>'
>>> print '<h1>\xc3\x9e</h1>'.decode('cp1252')
<h1>Þ</h1>

Recommended reading:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM