Encoding issue of a character in utf-8

Question

I get a link from a web page by using beautiful soup library through a.get('href') . In the link there is a strange character ® but when I get it became Â® . How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-

r = requests.get(url)

soup = BeautifulSoup(r.text)

Answer 1

Do not use r.text ; leave decoding to BeautifulSoup :

soup = BeautifulSoup(r.content)

r.content gives you the response in bytes, without decoding. r.text on the other hand, is the response decoded to unicode .

What happens is that the server did not include the character-set in the response headers. At that moment, requests follows the HTTP RFC 2261, section 3.7.1 : text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.

For your HTML page, that default is wrong, and you got incorrect results; r.text decoded the bytes as Latin-1, resulting in a Mojibake :

>>> print u'®'.encode('utf8').decode('latin1')
Â®

HTML can itself include the correct encoding in the HTML page itself , in the form of a <meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.

Even if the <meta> header tag is missing, BeautifulSoup includes other methods toauto-detect encodings .

Encoding issue of a character in utf-8

Question

1 answers

solution1
5 ACCPTED 2014-07-16 21:05:53

Encoding issue of a character in utf-8

Question

1 answers

solution1 5 ACCPTED 2014-07-16 21:05:53

solution1
5 ACCPTED 2014-07-16 21:05:53