简体   繁体   中英

Scraping tables from website using Selenium and Python

I'm new to python and was trying to scrape data from a website. So far, i can successfully scrape the data however the output is given in a linear format instead of the table its pulled from. Could someone give me some pointers on how to get the correct output?

import datetime
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

chrome_options = Options()
chrome_options.add_experimental_option( "prefs",{'profile.managed_default_content_settings.javascript': 2})
chrome_options.add_argument("-headless")
chrome = webdriver.Chrome('chromedriver', options = chrome_options)

today = datetime.date.today()
first = today.replace(day=1)
lastMonth = first - datetime.timedelta(days=1)

date_range = pd.date_range(start='2019-01-01', end=lastMonth, freq='MS')
date_range.strftime("%B/%Y")

date_list = [i.strftime('%B/%Y') for i in date_range]

base_url = "https://www.horseracing.net/tipsters/sky-sports-racing/alex-hammond/"

length = len(date_list)
i = 41 # set to 41 for testing purposes, I didnt want to loop all the tables until i sorted the output. 


while i < length:
    page_url = str(base_url)+""+str(date_list[i])
    chrome.get(page_url)
    EL2 = chrome.find_element(By.CLASS_NAME, 'main-table-container')
    print(EL2.text)
    i += 1

Current Output

Date
Pos
Tip
Track
Stake
Type
Odds
Result
30th
5th
Clipsham Gold
Yarmouth
1.0 WIN
Nap
9/4
LOST
29th
2nd
Cavendish
Bath
1.0 WIN
Nap
9/4
LOST
28th
2nd
King Of War
Brighton
1.0 WIN
Nap
2/1
LOST
27th
1st
The Vollan
Southwell
1.0 WIN
Nap
11/8
WON

Desired Output (all the sites appended into one dataframe, so I can export to an .xlsx)

screenshot

To create a DataFrame using Pandas from the This Month's Tips table within the website you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following Locator Strategies :

  • Using CSS_SELECTOR :

     driver.get('https://www.horseracing.net/tipsters/sky-sports-racing/alex-hammond/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div.cookie-accept-button.cookie-button-link"))).click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.close-modal"))).click() date = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-left-section']//span[@class='tiping-row-text' and contains(., '/')]")))] pose = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-left-section']//div[@class='tiping-row-cell position-cell']//span")))] tip = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-left-section']//div[@class='tiping-row-cell position-cell']//following::div[1]/span")))] track = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Track']//following::span[1]")))] stake = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Stake']//following::span[1]")))] types = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Type']//following::span[1]")))] odd = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Odds']//following::span[1]")))] result = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//div[@class='tiping-row-cell mobile-cell-hide']/span")))] df = pd.DataFrame(data=list(zip(date, pose, tip, track, stake, types, odd, result)), columns=['Date', 'Pos', 'Tip', 'Track', 'Stake', 'Type', 'Odds', 'Result']) print(df)
  • Note : You have to add the following imports :

     import pandas as pd from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
  • Console Output:

     Date Pos Tip Track Stake Type Odds Result 0 08/07/22 3rd Documenting York 1.0 WIN Nap 12/1 LOST 1 07/07/22 4th Zechariah Newmarket 1.0 WIN Nap 9/4 LOST 2 06/07/22 10th Lord Cherry Yarmouth 1.0 WIN Nap 9/2 LOST 3 05/07/22 3rd Landen Calling Uttoxeter 1.0 WIN Nap 13/8 LOST 4 04/07/22 PU De Barley Basket Worcester 1.0 WIN Nap 4/1 LOST 5 03/07/22 2nd Caius Marcius Market Rasen 1.0 WIN Nap 9/2 LOST 6 02/07/22 6th Chairman Sandown 1.0 WIN Nap 100/30 LOST 7 01/07/22 3rd Rum Cocktail Doncaster 1.0 WIN Nap 11/4 LOST

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM