简体   繁体   English

使用python beautiful soup和requests包时,HTML内容不正确

[英]Incorrect HTML content while using python beautiful soup and requests package

The HTML content I get after parsing a webpage using JSoup and BeautifulSoup is different as seen below. 我使用JSoup和BeautifulSoup解析网页后得到的HTML内容有所不同,如下所示。 Does anyone have the same issue and can you please let me know what was done to fix this? 有没有人有同样的问题,你能告诉我怎么做才能解决这个问题吗?

Check the third line in each block - 检查每个街区的第三行 -

======= JSoup ======= JSoup

<div class="col-full">
 <p><strong>Index Notifications</strong></p>
 <p></p><br>
<p> <br /> <b> March 28, 2014</b>
<br >
<br >

======= BeautifulSoup ======= BeautifulSoup

<div class="col-full">
<p><strong>Index Notifications</strong></p>
<p><p> <br>
<b> March 28, 2014</b>
<br>
<br>

When parsing broken HTML, different parsers will try to repair the broken tags differently; 在解析损坏的HTML时,不同的解析器将尝试以不同方式修复损坏的标记; there are no hard and fast rules on how to handle such errors. 如何处理此类错误没有严格的规则。

BeautifulSoup can make use of different parsers , and each will handle your content differently: BeautifulSoup可以使用不同的解析器 ,每个解析器都会以不同的方式处理您的内容:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.wisdomtree.com/etfs/index-notices.aspx'
>>> html = requests.get(url).content
>>> BeautifulSoup(html, 'html.parser').find('div', class_='col-full')
<div class="col-full">
<p><strong>Index Notifications</strong></p>
<p><p> <br>
<b> March 28, 2014</b>
<br> <br>
# ... cut ...
>>> BeautifulSoup(html, 'lxml').find('div', class_='col-full')
<div class="col-full">
<p><strong>Index Notifications</strong></p>
<p></p><p> <br/>
<b> March 28, 2014</b>
<br/> <br/>
# ... cut ...
>>> BeautifulSoup(html, 'html5lib').find('div', class_='col-full')
<div class="col-full">

            <p><strong>Index Notifications</strong></p>
            <p></p><p> <br/>
<b> March 28, 2014</b>
<br/>  <br/>
# ... cut ...

The html5lib parser is the slowest, but will generally parse broken HTML exactly like most browsers would. html5lib解析器是最慢的,但通常会解析破坏的HTML,就像大多数浏览器一样。 Both lxml and html5lib parsed this specific section of the document pretty much like JSoup did. lxmlhtml5lib解析了文档的这个特定部分,就像JSoup一样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM