简体   繁体   中英

BeautifulSoup fails to parse html with `html5lib`

BeautifulSoup fails to parse a html page with option html5lib , but works normally with the option html.parser . According to the docs , html5lib should be more lenient than html.parser , so why I met messy codes when using it to parse a html page ?

Following is a small executable example.(After change the html5lib with html.parser , the Chinese output are normal.)

#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup

ss = requests.Session()
res = ss.get("http://tech.qq.com/a/20151225/050487.htm")
html = res.content.decode("GBK").encode("utf-8")
soup = BeautifulSoup(html, 'html5lib')
print str(soup)[0:800]  # where you can see if the html is parsed normally or not

Don't recode your content. Leave handling the decoding to Beautifulsoup:

soup = BeautifulSoup(res.content, 'html5lib')

If you are going to re-encode, you need to replace the meta header that's present in the source:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

or manually decode and pass in Unicode:

soup = BeautifulSoup(res.content.decode('gbk'), 'html5lib')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM