BeautifulSoup无法解析完整的HTML-这是因为动态HTML？

Question

我正在尝试在此页面上刮擦桌子。

从浏览器调试器中可以看到，我想要的表存在于HTML中。 例如，您可以看到肽名称：

我编写了以下代码来提取该表：

for i in range(1001,1003):
#    try:
        res = requests.get("https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=" + str(i))
        soup = BeautifulSoup(res.content, 'html.parser')
        table = soup.find_all('table')
        print table

但是输出的输出是：

[<table bgcolor="#DAD5BF" border="1" cellpadding="5" width="970"><tr><td align="center">\n\t      This page displays user query in tabular form.\n</td></tr>\n</table>, <table width="970px"><tr><td align="center"><br/><font color="black" size="5px">1001  details</font><br/></td></tr></table>]

有人可以解释为什么find_all无法找到所有表（特别是我想要的表）以及如何解决此问题吗？

Answer 1

不知道为什么它没有显示。

由于它也是一张桌子，所以我继续使用Pandas做.read_html

import pandas as pd

url = 'https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=antitb_1001'

tables = pd.read_html(url)
table = tables[-1]

输出：

print (table)
                           0                                                  1
0        Primary information                                                NaN
1                         ID                                        antitb_1001
2               Peptide Name                                          Polydim-I
3                   Sequence                             AVAGEKLWLLPHLLKMLLTPTP
4    N-terminal Modification                                               Free
5    C-terminal Modification                                               Free
6      Chemical Modification                                               None
7             Linear/ Cyclic                                             Linear
8                     Length                                                 22
9                  Chirality                                                  L
10                    Nature                                        Amphipathic
11                    Source                                            Natural
12                    Origin  Isolated from the venom of the Neotropical was...
13                   Species         Mycobacterium abscessus subsp. massiliense
14                    Strain  Mycobacterium abscessus subsp. massiliense iso...
15  Inhibition Concentartion                                  MIC = 60.8 Î¼g/mL
16          In vitro/In vivo                                               Both
17                 Cell Line  Peritoneal macrophages, J774 macrophages cells...
18  Inhibition Concentartion  Treatment of infected macrophages with 7.6 Î¼g...
19              Cytotoxicity  Non-cytotoxic, 10% cytotoxicity on J774 cells ...
20             In vivo Model  6 to 8 weeks old BALB/c and IFN-Î³KO (Knockout...
21               Lethal Dose  2 mg/kg/mLW shows 90% reduction in bacterial load
22           Immune Response                                                NaN
23       Mechanism of Action                               Cell wall disruption
24                    Target                                          Cell wall
25       Combination Therapy                                               None
26          Other Activities                                                NaN
27                 Pubmed ID                                           26930596
28       Year of Publication                                               2016
29             3-D Structure                 View in Jmol or Download Structure

Answer 2

仅供参考（如果您想知道问题的根本原因）目标table具有无效的标记：

<table class ="tab" cellpadding= "5" ... STYLE="border-spacing: 0px;border-style: line ;
 <tr bgcolor="#DAD5BF"></tr>

请注意，起始标记未关闭： <table ... （应为<table ...> ），祖先也是<div>而结束标记是</p>

这就是BeautifulSoup无法将其识别为table ，因此soup.find_all('table')也不返回它soup.find_all('table')

但是，现代的浏览器具有内置的工具来“修复”损坏的标记，因此在浏览器table中看起来没有“损坏”：将</div>添加到祖先div而p标记转换为空节点<p></p>

BeautifulSoup无法解析完整的HTML-这是因为动态HTML？

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-01-16 11:39:21

解决方案2
0 2019-01-16 12:07:02

BeautifulSoup无法解析完整的HTML-这是因为动态HTML？

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-01-16 11:39:21

解决方案2 0 2019-01-16 12:07:02

解决方案1
2 已采纳 2019-01-16 11:39:21

解决方案2
0 2019-01-16 12:07:02