使用大熊猫来刮HTML：可以用来在网页中刮表格吗？

Question

我已经使用BeautifulSoup和urllib完成了一些基本的网络抓取。 不过，我最近碰到这个链接，说，所有你需要做的刮像网页这样一个是运行：

import pandas as pd
tables = pd.read_html("https://apps.sandiego.gov/sdfiredispatch/")
print(tables[0])

我认为这太好了，难以置信，因为很多时候我都在与beautifulsoup和urllib2斗争。

我尝试过拉出此页面上的表格：

url = "http://crdd.osdd.net/raghava/ahtpdb/display.php?details=1001"
tables = pd.read_html(url)
print tables[0]

我得到的输出是：

                              0
0  Detailed description of 1001 ID

我也在尝试其他一些方法，例如：

url = "http://crdd.osdd.net/raghava/ahtpdb/display.php?details=1001"
response = requests.get(url)
print response.content

或类似的东西：

web_page = 'http://crdd.osdd.net/raghava/ahtpdb/display.php?details=1001'
page = urllib2.urlopen(web_page)
soup = BeautifulSoup(page, 'html.parser')
print soup.get_text()

我知道这里通常有很多使用各种方法进行网络抓取的示例。 正如您所看到的，我一直在关注这些示例，只是我似乎无法使这种方法专门针对我的问题。 如果有人可以告诉我他们如何根据我的需要改进这些代码片段，我将不胜感激。

编辑1：作为旁注，我在另一个网页上尝试了相同的代码： https : //dbaasp.org/peptide-card? id =3 ，但是我想这甚至更复杂了。

编辑2：根据拉菲的建议，有一些不寻常的事情。 我已附上我要抓取的网页和网址 ; 拉菲，您可以看到我的URL与您使用的URL略有不同。 然后，当我尝试在我的网址上运行您的建议时：

url = "http://crdd.osdd.net/raghava/ahtpdb/srcbr.php?details=1001"
table = pd.read_html(url)
print table[0]
print table[1]
print table[2]
print table[3]
print table[4]
print table[5]

我得到的输出是这样的（被截断）：

 Browse SOURCE in AHTPDB This page gives statis...
1                            Browse SOURCE in AHTPDB
2  This page gives statistics of SOURCE fields an...
3  Following table enlists the number of entries ...
4  Following table enlists the number of entries ...
5                                               Milk
6                                                834
7  google.load("visualization", "1", {packages:["...

                                                  1   \
0                            Browse SOURCE in AHTPDB
1                                                NaN
2                                                NaN
3  Following table enlists the number of entries ...
4                                                NaN
5                                             Casein
6                                                723
7                                                NaN

                                                  2   \
0  This page gives statistics of SOURCE fields an...
1                                                NaN
2                                                NaN
3                                               Milk
4                                                NaN
5                                             Bovine
6                                                477
7                                                NaN

                                                  3   \
0  Following table enlists the number of entries ...
1                                                NaN
2                                                NaN
3                                             Casein
4                                                NaN
5                                            Cereals
6                                                419
7                                                NaN

                                                  4        5       6   \
0  Following table enlists the number of entries ...     Milk  Casein
1                                                NaN      NaN     NaN
2                                                NaN      NaN     NaN
3                                             Bovine  Cereals    Fish
4                                                NaN      NaN     NaN
5                                               Fish     Pork   Human
6                                                384      333     215
7                                                NaN      NaN     NaN

        7        8        9   \
0   Bovine  Cereals     Fish
1      NaN      NaN      NaN
2      NaN      NaN      NaN
3     Pork    Human  Chicken
4      NaN      NaN      NaN
5  Chicken  Soybean      Egg
6      177      159       97
7      NaN      NaN      NaN

                         ...                             16     17     18  \
0                        ...                          723.0  477.0  419.0
1                        ...                            NaN    NaN    NaN
2                        ...                            NaN    NaN    NaN
3                        ...                          384.0  333.0  215.0
4                        ...                            NaN    NaN    NaN
5                        ...                            NaN    NaN    NaN
6                        ...                            NaN    NaN    NaN
7                        ...                            NaN    NaN    NaN

      19     20     21     22     23    24  \
0  384.0  333.0  215.0  177.0  159.0  97.0
1    NaN    NaN    NaN    NaN    NaN   NaN
2    NaN    NaN    NaN    NaN    NaN   NaN
3  177.0  159.0   97.0    NaN    NaN   NaN
4    NaN    NaN    NaN    NaN    NaN   NaN
5    NaN    NaN    NaN    NaN    NaN   NaN
6    NaN    NaN    NaN    NaN    NaN   NaN
7    NaN    NaN    NaN    NaN    NaN   NaN

                                                  25
0  google.load("visualization", "1", {packages:["...
1                                                NaN
2                                                NaN
3                                                NaN
4                                                NaN
5                                                NaN
6                                                NaN
7                                                NaN

[8 rows x 26 columns]
                         0
0  Browse SOURCE in AHTPDB
                                                   0
0  This page gives statistics of SOURCE fields an...
                                                  0   \
0  Following table enlists the number of entries ...
1  Following table enlists the number of entries ...
2                                               Milk
3                                                834
4  google.load("visualization", "1", {packages:["...

                                                  1       2        3       4   \
0  Following table enlists the number of entries ...    Milk   Casein  Bovine
1                                                NaN     NaN      NaN     NaN
2                                             Casein  Bovine  Cereals    Fish
3                                                723     477      419     384
4                                                NaN     NaN      NaN     NaN

        5      6        7        8        9   ...      12     13     14  \
0  Cereals   Fish     Pork    Human  Chicken  ...   834.0  723.0  477.0
1      NaN    NaN      NaN      NaN      NaN  ...     NaN    NaN    NaN
2     Pork  Human  Chicken  Soybean      Egg  ...     NaN    NaN    NaN
3      333    215      177      159       97  ...     NaN    NaN    NaN
4      NaN    NaN      NaN      NaN      NaN  ...     NaN    NaN    NaN

      15     16     17     18     19     20    21
0  419.0  384.0  333.0  215.0  177.0  159.0  97.0
1    NaN    NaN    NaN    NaN    NaN    NaN   NaN
2    NaN    NaN    NaN    NaN    NaN    NaN   NaN
3    NaN    NaN    NaN    NaN    NaN    NaN   NaN

我不明白这与我显示的屏幕截图有何相似之处？ 是否因为'details = 1001'阻止了此方法，因为它写的不是.php页面？

编辑3：这有效：

url = 'http://crdd.osdd.net/raghava/ahtpdb/display.php?details=1001'
html = urllib.urlopen(url).read()
bs = BeautifulSoup(html, 'lxml')
tab = bs.find("table",{"class":"tab"})
data = []
rows = bs.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

print data

Answer 1

您应该在桌子位置上玩一些。 例如：我以您提供的网站为例，在那里找到了一个表（ url ）。 然后，我尝试了您尝试过的一段代码，但做了一点改动：

url = "http://crdd.osdd.net/raghava/ahtpdb/srcbr.php"
tables = pd.read_html(url)
print tables[4]

我得到的表就很好（带有标题-稍后再删除它没问题）。

这样做的原因是，在您复制的示例代码中，只有一个表（或多个表，而他们需要的表是第一个表）。 这就是为什么table[0]为他们提供所需的表的原因。 就我在这里显示的情况而言，网站使用表格进行布局，而第一个表格不是您要获取的表格（在这种情况下，它是第五个table[4] ，这就是为什么table[4]在这种情况下可以工作的原因）

注意：您可能希望将其保存到csv，以便于阅读：

url = "http://crdd.osdd.net/raghava/ahtpdb/srcbr.php"
tables = pd.read_html(url)
tables[4].to_csv("path/to/file.csv")

按照您的信息，请尝试以下操作：

from bs4 import BeautifulSoup
import urllib.request

url = 'http://crdd.osdd.net/raghava/ahtpdb/display.php?details=1001'
html = urllib.request.urlopen(url).read()
bs = BeautifulSoup(html)
tab = bs.find("table",{"class":"tab"})
print(tab)

您将需要清理它，但是表的所有数据都应该在那里可用。

使用大熊猫来刮HTML：可以用来在网页中刮表格吗？

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-10-17 11:36:11

使用大熊猫来刮HTML：可以用来在网页中刮表格吗？

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-10-17 11:36:11

解决方案1
2 已采纳 2018-10-17 11:36:11