简体   繁体   中英

Beautiful soup failing to parse this HTML

We're using Beautiful Soup to parse many websites successfully, but a few are given us problems. An example is this page:

http://www.designsponge.com/2013/04/biz-ladies-how-to-use-networking-to-improve-your-search-engine-rankings.html

We're feeding the exact source to beautiful soup, but it returns a stunted HTML string, though no errors...

Code:

soup = BeautifulSoup(site_html)
print str(soup.html)

Result:

<html class="no-js" lang="en"> <!--&lt;![endif]--> </html>

I'm trying to determine what's tripping it up, but nothing jumps out at me looking at the html source. Does anyone have some insight?

Try different parsers, the page parses fine with the html5lib parser:

>>> soup = BeautifulSoup(r.content, 'html5')
>>> len(soup.find_all('li'))
97

Not all parsers can treat broken HTML the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM