BeautifulSoup fails to parse html with `html5lib`

Question

BeautifulSoup fails to parse a html page with option html5lib , but works normally with the option html.parser . According to the docs , html5lib should be more lenient than html.parser , so why I met messy codes when using it to parse a html page ?

Following is a small executable example.(After change the html5lib with html.parser , the Chinese output are normal.)

#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup

ss = requests.Session()
res = ss.get("http://tech.qq.com/a/20151225/050487.htm")
html = res.content.decode("GBK").encode("utf-8")
soup = BeautifulSoup(html, 'html5lib')
print str(soup)[0:800]  # where you can see if the html is parsed normally or not

Answer 1

Don't recode your content. Leave handling the decoding to Beautifulsoup:

soup = BeautifulSoup(res.content, 'html5lib')

If you are going to re-encode, you need to replace the meta header that's present in the source:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

or manually decode and pass in Unicode:

soup = BeautifulSoup(res.content.decode('gbk'), 'html5lib')

BeautifulSoup fails to parse html with `html5lib`

Question

1 answers

solution1
1 ACCPTED 2015-12-25 14:34:45

BeautifulSoup fails to parse html with `html5lib`

Question

1 answers

solution1 1 ACCPTED 2015-12-25 14:34:45

solution1
1 ACCPTED 2015-12-25 14:34:45