I am trying to parse a HTML document using BeautifulSoup
with Python.
But it stops parsing at special characters, like here:
from bs4 import BeautifulSoup
doc = '''
<html>
<body>
<div>And I said «What the %&#@???»</div>
<div>some other text</div>
</body>
</html>'''
soup = BeautifulSoup(doc, 'html.parser')
print(soup)
This code should output the whole document. Instead, it prints only
<html>
<body>
<div>And I said «What the %</div></body></html>
The rest of the document is apparently lost . It was stopped by the combination '&#'
.
The question is, how to either setup BS or preprocess the document, to avoid such problems but lose as little text (which may be informative) as possible?
I use bs4 of version 4.6.0 with Python 3.6.1 on Windows 10.
Update . The method soup.prettify()
does not work, because the soup
is already broken.
You need to use the "html5lib" as the parser instead of "html.parser" in your BeautifulSoup
object. For example:
from bs4 import BeautifulSoup
doc = '''
<html>
<body>
<div>And I said «What the %&#@???»</div>
<div>some other text</div>
</body>
</html>'''
soup = BeautifulSoup(doc, 'html5lib')
# different parser ^
Now, if you'll print soup
it will display your desired string:
>>> print(soup)
<html><head></head><body>
<div>And I said «What the %&#@???»</div>
<div>some other text</div>
</body></html>
From the Difference Between Parsers document:
Unlike
html5lib
,html.parser
makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn't even bother to add an tag.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.