简体   繁体   中英

Encoding issue of a character in utf-8

I get a link from a web page by using beautiful soup library through a.get('href') . In the link there is a strange character ® but when I get it became ® . How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-

r = requests.get(url)

soup = BeautifulSoup(r.text)

Do not use r.text ; leave decoding to BeautifulSoup :

soup = BeautifulSoup(r.content)

r.content gives you the response in bytes, without decoding. r.text on the other hand, is the response decoded to unicode .

What happens is that the server did not include the character-set in the response headers. At that moment, requests follows the HTTP RFC 2261, section 3.7.1 : text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.

For your HTML page, that default is wrong, and you got incorrect results; r.text decoded the bytes as Latin-1, resulting in a Mojibake :

>>> print u'®'.encode('utf8').decode('latin1')
®

HTML can itself include the correct encoding in the HTML page itself , in the form of a <meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.

Even if the <meta> header tag is missing, BeautifulSoup includes other methods toauto-detect encodings .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM