简体   繁体   中英

python - scraping tables by navigating different options in drop down list

I'm trying to scrape data from this site: https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx

The default year has been set as 2018 (the most recent year) by the website and I want to scrape all available years.

A very similar question has been asked 4 years ago but it doesn't seem to work.

scraping a response from a selected option in dropdown list

All it does for me when I run it is print out the table from the default year regardless of parameter I assign.

I can't access different years via url since url doesn't change when I select options in the drop down box. So I tried using webdriver and xpath.

Here is my code that I attempted:

url = "https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx"

driver = webdriver.Chrome("/Applications/chromedriver")
driver.get(url)

year = 2017
driver.find_element_by_xpath("//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@value='"+str(year)+"']").click()
page = driver.page_source
bs_obj = BSoup(page, 'html.parser')

header_row = bs_obj.find_all('table')[0].find('thead').find('tr').find_all('th')
body_rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')
footer_row = bs_obj.find_all('table')[0].find('tfoot').find('tr').find_all('td')

headings = []
footings = []

for heading in header_row:
    headings.append(heading.get_text())

for footing in footer_row:
    footings.append(footing.get_text())

body = []

for row in body_rows:
    cells = row.find_all('td')
    row_temp = []
    for i in range(len(cells)):
        row_temp.append(cells[i].get_text())
    body.append(row_temp)

driver.quit()
print(headings)
print(body)
print(footings)

I expected the output to print out the table from the year 2017 as I specified but the actual output prints out the table from the year 2018 (the default year). Can anyone give me ideas to solve this problem?

Edit: I just found out that what I see by doing "Inspect" is different from what I get from "Page Source". Specifically, page source still has "2018" as the Select option (which is not what I want), whereas Inspect shows me "2017" is selected. But still stuck on how to use "Inspect" rather than page source.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup as BSoup
url = "https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx"
driver = webdriver.Chrome("/Applications/chromedriver")
year = 2017
driver.get(url)
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.XPATH, "//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@value='"+str(year)+"']"))
)
element.click()
#its better to wait till some text has changed
#but this will do for now


WebDriverWait(driver, 3).until(
    EC.text_to_be_present_in_element(
        (By.XPATH, "//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@selected='selected']"),
        str(year)
    )
)
#sleep for some time to complete ajax load of the table
#sleep(10)
page = driver.page_source
bs_obj = BSoup(page, 'html.parser')

header_row = bs_obj.find_all('table')[0].find('thead').find('tr').find_all('th')
body_rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')
footer_row = bs_obj.find_all('table')[0].find('tfoot').find('tr').find_all('td')

headings = []
footings = []

for heading in header_row:
    headings.append(heading.get_text())

for footing in footer_row:
    footings.append(footing.get_text())

body = []

for row in body_rows:
    cells = row.find_all('td')
    row_temp = []
    for i in range(len(cells)):
        row_temp.append(cells[i].get_text())
    body.append(row_temp)

driver.quit()
print(headings)
print(body)

Output

['순위', '팀명', 'AVG', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'TB', 'RBI', 'SAC', 'SF']
[['1', 'KIA', '0.302', '144', '5841', '5142', '906', '1554', '292', '29', '170', '2414', '868', '55', '56'], ['2', '두산', '0.294', '144', '5833', '5102', '849', '1499', '270', '20', '178', '2343', '812', '48', '47'], ['3', 'NC', '0.293', '144', '5790', '5079', '786', '1489', '277', '19', '149', '2251', '739', '62', '48'], ['4', '넥센', '0.290', '144', '5712', '5098', '789', '1479', '267', '30', '141', '2229', '748', '21', '42'], ['5', '한화', '0.287', '144', '5665', '5030', '737', '1445', '261', '16', '150', '2188', '684', '85', '38'], ['6', '롯데', '0.285', '144', '5671', '4994', '743', '1425', '250', '17', '151', '2162', '697', '76', '32'], ['7', 'LG', '0.281', '144', '5614', '4944', '699', '1390', '216', '20', '110', '1976', '663', '76', '55'], ['8', '삼성', '0.279', '144', '5707', '5095', '757', '1419', '255', '36', '145', '2181', '703', '58', '55'], ['9', 'KT', '0.275', '144', '5485', '4937', '655', '1360', '274', '17', '119', '2025', '625', '62', '45'], ['10', 'SK', '0.271', '144', '5564', '4925', '761', '1337', '222', '15', '234', '2291', '733', '57', '41']]

You have to wait for some time for the table to refresh after you click. Also read my comments. Sleep is not the best option.

Edit:

I have edited the code to wait till the selected text is the year. The code no longer uses sleep.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM