简体   繁体   中英

Missing special characters and tags while parsing HTML using BeautifulSoup

I am trying to parse a HTML document using BeautifulSoup with Python.

But it stops parsing at special characters, like here:

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''
soup = BeautifulSoup(doc,  'html.parser')
print(soup)

This code should output the whole document. Instead, it prints only

<html>
<body>
<div>And I said «What the %</div></body></html>

The rest of the document is apparently lost . It was stopped by the combination '&#' .

The question is, how to either setup BS or preprocess the document, to avoid such problems but lose as little text (which may be informative) as possible?

I use bs4 of version 4.6.0 with Python 3.6.1 on Windows 10.

Update . The method soup.prettify() does not work, because the soup is already broken.

You need to use the "html5lib" as the parser instead of "html.parser" in your BeautifulSoup object. For example:

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''

soup = BeautifulSoup(doc,  'html5lib')
#          different parser  ^

Now, if you'll print soup it will display your desired string:

>>> print(soup)
<html><head></head><body>
        <div>And I said «What the %&amp;#@???»</div>
        <div>some other text</div>

</body></html>

From the Difference Between Parsers document:

Unlike html5lib , html.parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn't even bother to add an tag.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM