從具有不同類的 HTML 表中抓取數據

Question

我正在嘗試構建一個 web 刮板，用於為我的數據可視化項目創建 covid-19 數據集。 我需要這張來自https://www.worldometers.info/coronavirus/的表格

import requests
from bs4 import BeautifulSoup

url = "https://www.worldometers.info/coronavirus/"
page = requests.get(url,verify=True)

soup = BeautifulSoup(page.content,features="lxml")

rows = soup.select("tr")


for data in rows:
    print(data.text)

我得到了想要的 output 但在每一行（國家）它還顯示我不想包含在我的數據集中的大陸名稱。 有什么解決辦法嗎？ 由於我是網絡抓取的新手，我需要我能得到的所有幫助。

更新：這是 html 代碼，數據集中不需要最后一個指定“歐洲”的 td。

<tr style="" role="row" class="odd">
<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/uk/">UK</a></td>
<td style="font-weight: bold; text-align:right" class="sorting_1">211,364</td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right;">31,241 </td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right">N/A</td>
<td style="text-align:right;font-weight:bold;">179,779</td>
<td style="font-weight: bold; text-align:right">1,559</td>
<td style="font-weight: bold; text-align:right">3,114</td>
<td style="font-weight: bold; text-align:right">460</td>
<td style="font-weight: bold; text-align:right">1,631,561</td>
<td style="font-weight: bold; text-align:right">24,034</td>
<td style="display:none" data-continent="Europe">Europe</td>
</tr>

Answer 1

試試下面的代碼。 密鑰 function 和 beautifulSoup 是find和findAll 。 閱讀下面的完整文檔/示例。 你應該設法收集你想要的東西。

編輯：大陸有一個“數據大陸”屬性。 然后，您應該循環查找沒有此屬性的行。 請注意，這與“世界”行的注釋相同，因此我“手動”忽略了它。 這是修改后的代碼：

import requests
from bs4 import BeautifulSoup

url = "https://www.worldometers.info/coronavirus/"
page = requests.get(url,verify=True)
soup = BeautifulSoup(page.content,features="lxml")

# find the table with id: 'main_table_countries_today'
table = soup.find('table', {'id': 'main_table_countries_today'})
body = table.find('tbody')

# looping through all rows, without 'data-continent' attribute :
for row in body.findAll('tr', {'data-continent': None}):
    print('\nParsing a new line:')
    values = row.findAll('td')
    # looping through all cells inside the row, ignoring the 'World' one:
    if values[0].text != 'World':
        for val in values:
            print(val.text)

結果是：

Parsing a new line:

Parsing a new line:
USA
1,322,223
+438
78,622 
+7
223,749
1,019,852
16,978
3,995
238
8,638,846
26,099
North America

Parsing a new line:
Spain
262,783
+2,666
26,478 
+179
173,157
63,148
1,741
5,620
566
1,932,455
41,332
Europe

Parsing a new line:
Italy
217,185

30,201 

99,023
87,961
1,168
3,592
500
2,445,063
40,440
Europe
[...]

Answer 2

您的代碼會獲取所有tr標簽，而不管它們的 position。 您需要指定表。 您對第一個表體中的數據感興趣。

response = requests.get(URL)

soup = BeautfiulSoup(response.text,'html.parser')
tbody = soup.find('tbody') # Selecting the first tbody
rows = tbody.find_all('tr')

for row in rows:
    print(row.text)

希望這可以幫助。

Answer 3

另一種解決方案。

from simplified_scrapy import SimplifiedDoc,utils
html = '''
<tr style="" role="row" class="odd">
<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/uk/">UK</a></td>
<td style="font-weight: bold; text-align:right" class="sorting_1">211,364</td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right;">31,241 </td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right">N/A</td>
<td style="text-align:right;font-weight:bold;">179,779</td>
<td style="font-weight: bold; text-align:right">1,559</td>
<td style="font-weight: bold; text-align:right">3,114</td>
<td style="font-weight: bold; text-align:right">460</td>
<td style="font-weight: bold; text-align:right">1,631,561</td>
<td style="font-weight: bold; text-align:right">24,034</td>
<td style="display:none" data-continent="Europe">Europe</td>
</tr>
'''
doc = SimplifiedDoc(html)
rows = doc.selects('tr').selects('td')
for data in rows:
  print(data.notContains('display:none',attr="style").text)

結果：

['UK', '211,364', '', '31,241', '', 'N/A', '179,779', '1,559', '3,114', '460', '1,631,561', '24,034']

這里有更多例子。 https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

從具有不同類的 HTML 表中抓取數據

問題描述

3 個解決方案

解決方案1
0 2020-05-09 12:21:06

解決方案2
0 2020-05-09 12:23:45

解決方案3
0 2020-05-09 14:19:05

從具有不同類的 HTML 表中抓取數據

問題描述

3 個解決方案

解決方案1 0 2020-05-09 12:21:06

解決方案2 0 2020-05-09 12:23:45

解決方案3 0 2020-05-09 14:19:05

解決方案1
0 2020-05-09 12:21:06

解決方案2
0 2020-05-09 12:23:45

解決方案3
0 2020-05-09 14:19:05