BeautifulSoup fails to parse a html page with option html5lib
, but works normally with the option html.parser
. According to the docs , html5lib
should be more lenient than html.parser
, so why I met messy codes when using it to parse a html page ?
Following is a small executable example.(After change the html5lib
with html.parser
, the Chinese output are normal.)
#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup
ss = requests.Session()
res = ss.get("http://tech.qq.com/a/20151225/050487.htm")
html = res.content.decode("GBK").encode("utf-8")
soup = BeautifulSoup(html, 'html5lib')
print str(soup)[0:800] # where you can see if the html is parsed normally or not
Don't recode your content. Leave handling the decoding to Beautifulsoup:
soup = BeautifulSoup(res.content, 'html5lib')
If you are going to re-encode, you need to replace the meta
header that's present in the source:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
or manually decode and pass in Unicode:
soup = BeautifulSoup(res.content.decode('gbk'), 'html5lib')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.