[英]Scraping Data from Website obscuring Data python
我正在嘗試從各個 URL 中抓取各個擊球數據,這是一個示例( https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020 )
它似乎隱藏了數據,或者我無法使用
driver = webdriver.Chrome('/Users/gru/Documents/chromedriver')
driver.get('https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020')
html_page = driver.page_source
time.sleep(15)
soup = BeautifulSoup(html_page, 'lxml')
for j in soup.find_all('tr'):
drounders=[]
for h in j.find_all('td'):
drounders.append(h.get_text())
print(drounders)
這是前幾行預期的行
Game Date Bat Team Fld Team Pitcher Result EV (MPH) LA (°) Dist (ft) Direction Pitch (MPH) Pitch Type
1 2020-08-12 Carrasco, Carlos strikeout
2 2020-08-12 Carrasco, Carlos strikeout
3 2020-08-12 Carrasco, Carlos force_out Opposite
4 2020-08-11 Allen, Logan force_out 107.8 -25 5 Pull 94.0 4-Seam Fastball
5 2020-08-11 Allen, Logan strikeout 77.3 Curveball
6 2020-08-11 Hill, Cam sac_fly 100.5 42 345 Straightaway 91.6 4-Seam Fastball
我在這里看到的唯一問題是Bat Team列,因為該列包含圖像而不是文本,在這個答案中,我已經從Bat Team列中抓取了圖像的鏈接,並且我最后添加了 position 如果你想忽略然后刪除img
從for loop
代碼:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
site = 'https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020'
finalData = []
driver = webdriver.Chrome(executable_path = 'chromedriver.exe') # Here I am using Chrome's web driver
#For Firefox Web driver
#driver = webdriver.Firefox(executable_path = 'geckodriver.exe')
driver.get(site)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find("div", id = "gamelogs_statcast")
trs = table.find_all("tr")
for trValue in trs:
data = []
txt = str(trValue.text)
img =str(trValue.find("img"))
data = txt + img
finalData.append(data)
print(finalData)
Output:
['Game DateBat TeamFld TeamPitcherResultEV (MPH)LA (°)Dist (ft)DirectionPitch (MPH)Pitch TypeNone', '1 2020-08-13 Burnes, Corbin field_out 104.1 24 400 Straightaway 95.7 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '2 2020-08-13 Burnes, Corbin walk 89.2 Slider <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '3 2020-08-13 Anderson, Brett hit_by_pitch 89.5 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>' ........]
希望這會有所幫助,如果此答案需要任何其他幫助,請告訴我。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.