简体   繁体   中英

Scraping Data from Website obscuring Data python

I am trying to scrape the individual batted ball data from individual URLs, here is an example ( https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020 )

It seems to hide the data or I cant get it by using

driver = webdriver.Chrome('/Users/gru/Documents/chromedriver')
driver.get('https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020')
html_page = driver.page_source
time.sleep(15)
soup = BeautifulSoup(html_page, 'lxml')
for j in soup.find_all('tr'):
    drounders=[]
    for h in j.find_all('td'):
        drounders.append(h.get_text())
    print(drounders)

Here are the first few expected rows

Game Date   Bat Team    Fld Team    Pitcher Result  EV (MPH)    LA (°)  Dist (ft)   Direction   Pitch (MPH) Pitch Type  
1   2020-08-12          Carrasco, Carlos    strikeout                           
2   2020-08-12          Carrasco, Carlos    strikeout                           
3   2020-08-12          Carrasco, Carlos    force_out               Opposite            
4   2020-08-11          Allen, Logan    force_out   107.8   -25 5   Pull    94.0    4-Seam Fastball 
5   2020-08-11          Allen, Logan    strikeout                   77.3    Curveball   
6   2020-08-11          Hill, Cam   sac_fly 100.5   42  345 Straightaway    91.6    4-Seam Fastball

The only problem I see here is Bat Team column because the column contains image not text, In this answer I have scraped the link of image from Bat Team column and that column I have added at last position and if you want to ignore then remove img from for loop

Code:

from selenium import webdriver
from bs4 import BeautifulSoup
import time


site = 'https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020'
finalData = []
driver = webdriver.Chrome(executable_path = 'chromedriver.exe') # Here I am using Chrome's web driver
#For Firefox Web driver
#driver = webdriver.Firefox(executable_path = 'geckodriver.exe') 
driver.get(site)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find("div", id = "gamelogs_statcast")
trs = table.find_all("tr")
for trValue in trs:
    data = []
    txt = str(trValue.text)
    img =str(trValue.find("img"))
    data = txt + img
    finalData.append(data)

print(finalData)

Output:

['Game DateBat TeamFld TeamPitcherResultEV (MPH)LA (°)Dist (ft)DirectionPitch (MPH)Pitch TypeNone', '1 2020-08-13   Burnes, Corbin field_out 104.1 24 400 Straightaway 95.7 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '2 2020-08-13   Burnes, Corbin walk     89.2 Slider <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '3 2020-08-13   Anderson, Brett hit_by_pitch     89.5 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>' ........]

Hope this helps and let me know if any other help require for this answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM