簡體   English   中英

從網站中抓取數據模糊數據 python

[英]Scraping Data from Website obscuring Data python

我正在嘗試從各個 URL 中抓取各個擊球數據,這是一個示例( https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020 )

它似乎隱藏了數據,或者我無法使用

driver = webdriver.Chrome('/Users/gru/Documents/chromedriver')
driver.get('https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020')
html_page = driver.page_source
time.sleep(15)
soup = BeautifulSoup(html_page, 'lxml')
for j in soup.find_all('tr'):
    drounders=[]
    for h in j.find_all('td'):
        drounders.append(h.get_text())
    print(drounders)

這是前幾行預期的行

Game Date   Bat Team    Fld Team    Pitcher Result  EV (MPH)    LA (°)  Dist (ft)   Direction   Pitch (MPH) Pitch Type  
1   2020-08-12          Carrasco, Carlos    strikeout                           
2   2020-08-12          Carrasco, Carlos    strikeout                           
3   2020-08-12          Carrasco, Carlos    force_out               Opposite            
4   2020-08-11          Allen, Logan    force_out   107.8   -25 5   Pull    94.0    4-Seam Fastball 
5   2020-08-11          Allen, Logan    strikeout                   77.3    Curveball   
6   2020-08-11          Hill, Cam   sac_fly 100.5   42  345 Straightaway    91.6    4-Seam Fastball

我在這里看到的唯一問題是Bat Team列,因為該列包含圖像而不是文本,在這個答案中,我已經從Bat Team列中抓取了圖像的鏈接,並且我最后添加了 position 如果你想忽略然后刪除imgfor loop

代碼:

from selenium import webdriver
from bs4 import BeautifulSoup
import time


site = 'https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020'
finalData = []
driver = webdriver.Chrome(executable_path = 'chromedriver.exe') # Here I am using Chrome's web driver
#For Firefox Web driver
#driver = webdriver.Firefox(executable_path = 'geckodriver.exe') 
driver.get(site)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find("div", id = "gamelogs_statcast")
trs = table.find_all("tr")
for trValue in trs:
    data = []
    txt = str(trValue.text)
    img =str(trValue.find("img"))
    data = txt + img
    finalData.append(data)

print(finalData)

Output:

['Game DateBat TeamFld TeamPitcherResultEV (MPH)LA (°)Dist (ft)DirectionPitch (MPH)Pitch TypeNone', '1 2020-08-13   Burnes, Corbin field_out 104.1 24 400 Straightaway 95.7 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '2 2020-08-13   Burnes, Corbin walk     89.2 Slider <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '3 2020-08-13   Anderson, Brett hit_by_pitch     89.5 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>' ........]

希望這會有所幫助,如果此答案需要任何其他幫助,請告訴我。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM