使用python beautiful soup和requests包时，HTML内容不正确

Question

The HTML content I get after parsing a webpage using JSoup and BeautifulSoup is different as seen below. 我使用JSoup和BeautifulSoup解析网页后得到的HTML内容有所不同，如下所示。 Does anyone have the same issue and can you please let me know what was done to fix this? 有没有人有同样的问题，你能告诉我怎么做才能解决这个问题吗？

Check the third line in each block - 检查每个街区的第三行 -

======= JSoup ======= JSoup

<div class="col-full">
 <p><strong>Index Notifications</strong></p>
 <p></p><br>
<p> <br /> <b> March 28, 2014</b>
<br >
<br >

======= BeautifulSoup ======= BeautifulSoup

<div class="col-full">
<p><strong>Index Notifications</strong></p>
<p><p> <br>
<b> March 28, 2014</b>
<br>
<br>

Answer 1

When parsing broken HTML, different parsers will try to repair the broken tags differently; 在解析损坏的HTML时，不同的解析器将尝试以不同方式修复损坏的标记; there are no hard and fast rules on how to handle such errors. 如何处理此类错误没有严格的规则。

BeautifulSoup can make use of different parsers , and each will handle your content differently: BeautifulSoup可以使用不同的解析器，每个解析器都会以不同的方式处理您的内容：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.wisdomtree.com/etfs/index-notices.aspx'
>>> html = requests.get(url).content
>>> BeautifulSoup(html, 'html.parser').find('div', class_='col-full')
<div class="col-full">
<p><strong>Index Notifications</strong></p>
<p><p> <br>
<b> March 28, 2014</b>
<br> <br>
# ... cut ...
>>> BeautifulSoup(html, 'lxml').find('div', class_='col-full')
<div class="col-full">
<p><strong>Index Notifications</strong></p>
<p></p><p> <br/>
<b> March 28, 2014</b>
<br/> <br/>
# ... cut ...
>>> BeautifulSoup(html, 'html5lib').find('div', class_='col-full')
<div class="col-full">

            <p><strong>Index Notifications</strong></p>
            <p></p><p> <br/>
<b> March 28, 2014</b>
<br/>  <br/>
# ... cut ...

The html5lib parser is the slowest, but will generally parse broken HTML exactly like most browsers would. html5lib解析器是最慢的，但通常会解析破坏的HTML，就像大多数浏览器一样。 Both lxml and html5lib parsed this specific section of the document pretty much like JSoup did. lxml和html5lib解析了文档的这个特定部分，就像JSoup一样。

使用python beautiful soup和requests包时，HTML内容不正确

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-05-23 17:19:30

使用python beautiful soup和requests包时，HTML内容不正确

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-05-23 17:19:30

解决方案1
3 已采纳 2014-05-23 17:19:30