無法使用網站上的 BeautifulSoup 抓取表格數據

Question

我正在按照在線教程 ( https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/ ) 進行網絡抓取 html 表。 當我按照教程進行操作時，我能夠抓取表格數據，但是當我嘗試從中抓取數據時（ https://www.masslottery.com/games/lottery/search/results-history.html?game_id= 15&mode=2&selected_date=2019-03-04&x=12&y=11 ) 網站我無法這樣做。

我之前嘗試過使用 scrapy 但得到了相同的結果。

這是我使用的代碼。

import urllib.request

wiki = "https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11"
page = urllib.request.urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")


all_tables=soup.find_all('table')


right_table=soup.find('table', class_='zebra-body-only')
print(right_table)

這是我在終端上運行這段代碼時得到的

<table cellspacing="0" class="zebra-body-only">
<tbody id="target-area">
</tbody>
</table>

雖然當我使用谷歌瀏覽器檢查大眾彩票的網站時，這就是我所看到的

<table cellspacing="0" class="zebra-body-only"                                  <tbody id="target-area">
<tr class="odd">
<th>Draw #</th>
<th>Draw Date</th>
<th>Winning Number</th>
<th>Bonus</th>
</tr>
<tr><td>2107238</td>
<td>03/04/2019</td>
<td>01-04-05-16-23-24-27-32-34-41-42-44-47-49-52-55-63-65-67-78</td><td>No Bonus</td>
</tr>
<tr class="odd">
<td>2107239</td>
<td>03/04/2019</td>
<td>04-05-11-15-19-20-23-24-25-28-41-45-52-63-64-68-71-72-73-76</td><td>4x</td>
</tr> 
....(And so on)

我希望能夠從此表中提取數據。

Answer 1

發生這種情況是因為該網站進行了另一次調用以加載結果。 初始鏈接僅加載頁面而不加載結果。 使用 chrome 開發工具檢查請求，您將能夠找出需要復制以獲得結果的請求。

這意味着要獲得結果，您只需調用上面提到的請求，而不必調用網頁。

幸運的是，您必須調用的端點已經采用了良好的 JSON 格式。

GET https://www.masslottery.com/data/json/search/dailygames/history/15/201903.json?_=1555083561238

我假設1555083561238是時間戳。

Answer 2

該頁面是動態的，因此它會在您發出請求后呈現。 您可以 a) 使用 JC1 的解決方案並訪問 json 響應。 或者你可以使用 Seleneium 來模擬打開瀏覽器，渲染頁面，然后抓取表格：

from bs4 import BeautifulSoup
from selenium import webdriver


url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'  

driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source

soup = BeautifulSoup(page, "lxml")

all_tables=soup.find_all('table')


right_table=soup.find('table', class_='zebra-body-only')

另外作為旁注：通常如果我看到<table>標簽，我會讓 Pandas 為我完成工作（注意，我無法訪問該站點，因此無法測試這些）：

import pandas as pd
from selenium import webdriver


url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'  

driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source

# will return a list of dataframes
tables = pd.read_html(page)

# chose the dataframe you want from the list by it's position
df = tables[0]

Answer 3

是的，我會把你得到的數據保存在一個文件中，看看你要找的東西是否真的在那里。 用 open('stuff.html','w') 作為 f: f.write(response.text)

unicode，嘗試： import codecs codecs.open(fp,'w','utf-8') as f:

如果你沒有看到你在那里尋找什么，你將不得不找出正確的 url 來加載，檢查 chrome 開發人員選項這通常很難

簡單的方法是使用 selenium 確保你等到你要找的東西出現在頁面上（這是動態的）

無法使用網站上的 BeautifulSoup 抓取表格數據

問題描述

3 個解決方案

解決方案1
1 已采納 2019-04-12 15:41:49

解決方案2
0 2019-04-12 15:45:50

解決方案3
0 2019-04-13 02:48:59

無法使用網站上的 BeautifulSoup 抓取表格數據

問題描述

3 個解決方案

解決方案1 1 已采納 2019-04-12 15:41:49

解決方案2 0 2019-04-12 15:45:50

解決方案3 0 2019-04-13 02:48:59

解決方案1
1 已采納 2019-04-12 15:41:49

解決方案2
0 2019-04-12 15:45:50

解決方案3
0 2019-04-13 02:48:59