![](/img/trans.png)
[英]Parsing HTML and writing to CSV using Beautifulsoup - AttributeError or no html being parsed
[英]Full HTML is not being parsed with BeautifulSoup - is this because of dynamic HTML?
我正在尝试在此页面上刮擦桌子。
从浏览器调试器中可以看到,我想要的表存在于HTML中。 例如,您可以看到肽名称:
我编写了以下代码来提取该表:
for i in range(1001,1003):
# try:
res = requests.get("https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=" + str(i))
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find_all('table')
print table
但是输出的输出是:
[<table bgcolor="#DAD5BF" border="1" cellpadding="5" width="970"><tr><td align="center">\n\t This page displays user query in tabular form.\n</td></tr>\n</table>, <table width="970px"><tr><td align="center"><br/><font color="black" size="5px">1001 details</font><br/></td></tr></table>]
有人可以解释为什么find_all无法找到所有表(特别是我想要的表)以及如何解决此问题吗?
不知道为什么它没有显示。
由于它也是一张桌子,所以我继续使用Pandas做.read_html
import pandas as pd
url = 'https://webs.iiitd.edu.in/raghava/antitbpdb/display.php?details=antitb_1001'
tables = pd.read_html(url)
table = tables[-1]
输出:
print (table)
0 1
0 Primary information NaN
1 ID antitb_1001
2 Peptide Name Polydim-I
3 Sequence AVAGEKLWLLPHLLKMLLTPTP
4 N-terminal Modification Free
5 C-terminal Modification Free
6 Chemical Modification None
7 Linear/ Cyclic Linear
8 Length 22
9 Chirality L
10 Nature Amphipathic
11 Source Natural
12 Origin Isolated from the venom of the Neotropical was...
13 Species Mycobacterium abscessus subsp. massiliense
14 Strain Mycobacterium abscessus subsp. massiliense iso...
15 Inhibition Concentartion MIC = 60.8 μg/mL
16 In vitro/In vivo Both
17 Cell Line Peritoneal macrophages, J774 macrophages cells...
18 Inhibition Concentartion Treatment of infected macrophages with 7.6 μg...
19 Cytotoxicity Non-cytotoxic, 10% cytotoxicity on J774 cells ...
20 In vivo Model 6 to 8 weeks old BALB/c and IFN-γKO (Knockout...
21 Lethal Dose 2 mg/kg/mLW shows 90% reduction in bacterial load
22 Immune Response NaN
23 Mechanism of Action Cell wall disruption
24 Target Cell wall
25 Combination Therapy None
26 Other Activities NaN
27 Pubmed ID 26930596
28 Year of Publication 2016
29 3-D Structure View in Jmol or Download Structure
仅供参考(如果您想知道问题的根本原因)目标table
具有无效的标记:
<table class ="tab" cellpadding= "5" ... STYLE="border-spacing: 0px;border-style: line ;
<tr bgcolor="#DAD5BF"></tr>
请注意,起始标记未关闭: <table ...
(应为<table ...>
),祖先也是<div>
而结束标记是</p>
这就是BeautifulSoup无法将其识别为table
,因此soup.find_all('table')
也不返回它soup.find_all('table')
但是,现代的浏览器具有内置的工具来“修复”损坏的标记,因此在浏览器table
中看起来没有“损坏”:将</div>
添加到祖先div
而p
标记转换为空节点<p></p>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.