[英]Removing <br> tag for proper alignment while webscraping using selenium and python
I want to remove the <br>
html tag while web scraping the page, but replace doesn't seem to work.我想删除
<br>
html 标签,同时 web 抓取页面,但替换似乎不起作用。 i'm not sure if there is another way to do it or better way to do it using selenium and python. thank you in advance.我不确定是否有其他方法或更好的方法使用 selenium 和 python。提前谢谢你。
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("drivers/chromedriver")
driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")
state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Hampshire")
driver.find_element_by_id("city").send_keys("Moultonborough")
driver.find_element_by_id("name").send_keys("Moultonborough Academy")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()
courses_subheading = driver.find_elements_by_tag_name("th.header")
print(courses_subheading[0].text, " " ,courses_subheading[1].text, " ", courses_subheading[2].text, " ", courses_subheading[3].text, " ", courses_subheading[4].text
I tried this:我试过这个:
for i in courses_subheading:
courses_subheading.replace("<br>", " ")
but get an error: AttributeError: 'list' object has no attribute 'replace'
但得到一个错误:
AttributeError: 'list' object has no attribute 'replace'
currently, it looks like this:目前,它看起来像这样:
Course
Weight Title Notes Max
Credits OK
Through Disability
Course
but i want it like this:但我想要这样:
Course Weight Title Notes Max Credits OK Through Disability Course
Instead of removing the <br>
you can easily avoid the <br>
tags.您可以轻松避免使用
<br>
标签,而不是删除<br>
。 To print the table headers, eg Title , Notes , etc, you need to induce WebDriverWait for the visibility_of_all_elements_located()
and you can use either of the following Locator Strategies :要打印表格标题,例如Title 、 Notes等,您需要为
visibility_of_all_elements_located()
引入WebDriverWait并且您可以使用以下任一定位器策略:
Using css_selector
:使用
css_selector
:
driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_css_selector("input#city").send_keys("Moultonborough") driver.find_element_by_css_selector("input#name").send_keys("Moultonborough Academy") driver.find_element_by_css_selector("input[value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#approvedCourseTable_1 th.header")))])
Using xpath
:使用
xpath
:
driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_xpath("//input[@id='city']").send_keys("Moultonborough") driver.find_element_by_xpath("//input[@id='name']").send_keys("Moultonborough Academy") driver.find_element_by_xpath("//input[@value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='approvedCourseTable_1']//th[@class='header']")))])
Console Output:控制台 Output:
['Course\nWeight', 'Title', 'Notes', 'Max\nCredits', 'OK\nThrough', 'Disability\nCourse']
Note : You have to add the following imports:注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
To complete, if you really want to remove the br
tags, you can use (I've fixed your XPath expression):要完成,如果你真的想删除
br
标签,你可以使用(我已经修复了你的 XPath 表达式):
import re
courses_subheading = driver.find_elements_by_xpath("(//tr[th[@class='header']])[1]/th")
headers = [re.sub('\s+',' ',el.text) for el in courses_subheading]
print(headers)
Output: Output:
['Course Weight', 'Title', 'Notes', 'Max Credits', 'OK Through', 'Disability Course']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.