从网站中抓取数据模糊数据 python

Question

I am trying to scrape the individual batted ball data from individual URLs, here is an example ( https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020 )我正在尝试从各个 URL 中抓取各个击球数据，这是一个示例（ https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020 )

It seems to hide the data or I cant get it by using它似乎隐藏了数据，或者我无法使用

driver = webdriver.Chrome('/Users/gru/Documents/chromedriver')
driver.get('https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020')
html_page = driver.page_source
time.sleep(15)
soup = BeautifulSoup(html_page, 'lxml')
for j in soup.find_all('tr'):
    drounders=[]
    for h in j.find_all('td'):
        drounders.append(h.get_text())
    print(drounders)

Here are the first few expected rows这是前几行预期的行

Game Date   Bat Team    Fld Team    Pitcher Result  EV (MPH)    LA (°)  Dist (ft)   Direction   Pitch (MPH) Pitch Type  
1   2020-08-12          Carrasco, Carlos    strikeout                           
2   2020-08-12          Carrasco, Carlos    strikeout                           
3   2020-08-12          Carrasco, Carlos    force_out               Opposite            
4   2020-08-11          Allen, Logan    force_out   107.8   -25 5   Pull    94.0    4-Seam Fastball 
5   2020-08-11          Allen, Logan    strikeout                   77.3    Curveball   
6   2020-08-11          Hill, Cam   sac_fly 100.5   42  345 Straightaway    91.6    4-Seam Fastball

Answer 1

The only problem I see here is Bat Team column because the column contains image not text, In this answer I have scraped the link of image from Bat Team column and that column I have added at last position and if you want to ignore then remove img from for loop我在这里看到的唯一问题是Bat Team列，因为该列包含图像而不是文本，在这个答案中，我已经从Bat Team列中抓取了图像的链接，并且我最后添加了 position 如果你想忽略然后删除img从for loop

Code:代码：

from selenium import webdriver
from bs4 import BeautifulSoup
import time


site = 'https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020'
finalData = []
driver = webdriver.Chrome(executable_path = 'chromedriver.exe') # Here I am using Chrome's web driver
#For Firefox Web driver
#driver = webdriver.Firefox(executable_path = 'geckodriver.exe') 
driver.get(site)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find("div", id = "gamelogs_statcast")
trs = table.find_all("tr")
for trValue in trs:
    data = []
    txt = str(trValue.text)
    img =str(trValue.find("img"))
    data = txt + img
    finalData.append(data)

print(finalData)

Output: Output：

['Game DateBat TeamFld TeamPitcherResultEV (MPH)LA (°)Dist (ft)DirectionPitch (MPH)Pitch TypeNone', '1 2020-08-13   Burnes, Corbin field_out 104.1 24 400 Straightaway 95.7 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '2 2020-08-13   Burnes, Corbin walk     89.2 Slider <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '3 2020-08-13   Anderson, Brett hit_by_pitch     89.5 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>' ........]

Hope this helps and let me know if any other help require for this answer.希望这会有所帮助，如果此答案需要任何其他帮助，请告诉我。

从网站中抓取数据模糊数据 python

问题描述

1 个解决方案

解决方案1
0 2020-08-14 19:00:35

从网站中抓取数据模糊数据 python

问题描述

1 个解决方案

解决方案1 0 2020-08-14 19:00:35

解决方案1
0 2020-08-14 19:00:35