简体   繁体   中英

Removing <br> tag for proper alignment while webscraping using selenium and python

I want to remove the <br> html tag while web scraping the page, but replace doesn't seem to work. i'm not sure if there is another way to do it or better way to do it using selenium and python. thank you in advance.

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("drivers/chromedriver")

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Hampshire")

driver.find_element_by_id("city").send_keys("Moultonborough")
driver.find_element_by_id("name").send_keys("Moultonborough Academy")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

courses_subheading = driver.find_elements_by_tag_name("th.header")

print(courses_subheading[0].text, "     " ,courses_subheading[1].text, "     ", courses_subheading[2].text, "     ", courses_subheading[3].text, "     ", courses_subheading[4].text

I tried this:

for i in courses_subheading:
    courses_subheading.replace("<br>", " ")

but get an error: AttributeError: 'list' object has no attribute 'replace'

currently, it looks like this:

Course
Weight     Title     Notes     Max
Credits       OK
Through       Disability
Course

but i want it like this:

Course Weight     Title     Notes     Max Credits     OK     Through     Disability Course

Instead of removing the <br> you can easily avoid the <br> tags. To print the table headers, eg Title , Notes , etc, you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies :

  • Using css_selector :

     driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_css_selector("input#city").send_keys("Moultonborough") driver.find_element_by_css_selector("input#name").send_keys("Moultonborough Academy") driver.find_element_by_css_selector("input[value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#approvedCourseTable_1 th.header")))])
  • Using xpath :

     driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_xpath("//input[@id='city']").send_keys("Moultonborough") driver.find_element_by_xpath("//input[@id='name']").send_keys("Moultonborough Academy") driver.find_element_by_xpath("//input[@value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='approvedCourseTable_1']//th[@class='header']")))])
  • Console Output:

     ['Course\nWeight', 'Title', 'Notes', 'Max\nCredits', 'OK\nThrough', 'Disability\nCourse']
  • Note : You have to add the following imports:

     from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

To complete, if you really want to remove the br tags, you can use (I've fixed your XPath expression):

import re
courses_subheading = driver.find_elements_by_xpath("(//tr[th[@class='header']])[1]/th")
headers = [re.sub('\s+',' ',el.text) for el in courses_subheading]
print(headers)

Output:

['Course Weight', 'Title', 'Notes', 'Max Credits', 'OK Through', 'Disability Course']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM