删除 使用 selenium 和 python 进行网络抓取时标记正确的 alignment

Question

I want to remove the   html tag while web scraping the page, but replace doesn't seem to work.我想删除  html 标签，同时 web 抓取页面，但替换似乎不起作用。 i'm not sure if there is another way to do it or better way to do it using selenium and python. thank you in advance.我不确定是否有其他方法或更好的方法使用 selenium 和 python。提前谢谢你。

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("drivers/chromedriver")

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Hampshire")

driver.find_element_by_id("city").send_keys("Moultonborough")
driver.find_element_by_id("name").send_keys("Moultonborough Academy")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

courses_subheading = driver.find_elements_by_tag_name("th.header")

print(courses_subheading[0].text, "     " ,courses_subheading[1].text, "     ", courses_subheading[2].text, "     ", courses_subheading[3].text, "     ", courses_subheading[4].text

I tried this:我试过这个：

for i in courses_subheading:
    courses_subheading.replace("<br>", " ")

but get an error: AttributeError: 'list' object has no attribute 'replace'但得到一个错误： AttributeError: 'list' object has no attribute 'replace'

currently, it looks like this:目前，它看起来像这样：

Course
Weight     Title     Notes     Max
Credits       OK
Through       Disability
Course

but i want it like this:但我想要这样：

Course Weight     Title     Notes     Max Credits     OK     Through     Disability Course

Answer 1

Instead of removing the   you can easily avoid the   tags.您可以轻松避免使用 标签，而不是删除  。 To print the table headers, eg Title , Notes , etc, you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies :要打印表格标题，例如Title 、 Notes等，您需要为visibility_of_all_elements_located()引入WebDriverWait并且您可以使用以下任一定位器策略：

Using css_selector :使用css_selector ：

 driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_css_selector("input#city").send_keys("Moultonborough") driver.find_element_by_css_selector("input#name").send_keys("Moultonborough Academy") driver.find_element_by_css_selector("input[value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#approvedCourseTable_1 th.header")))])

Using xpath :使用xpath ：

 driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_xpath("//input[@id='city']").send_keys("Moultonborough") driver.find_element_by_xpath("//input[@id='name']").send_keys("Moultonborough Academy") driver.find_element_by_xpath("//input[@value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='approvedCourseTable_1']//th[@class='header']")))])

Console Output:控制台 Output：

 ['Course\nWeight', 'Title', 'Notes', 'Max\nCredits', 'OK\nThrough', 'Disability\nCourse']

Note : You have to add the following imports:注意：您必须添加以下导入：

 from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

Answer 2

To complete, if you really want to remove the br tags, you can use (I've fixed your XPath expression):要完成，如果你真的想删除br标签，你可以使用（我已经修复了你的 XPath 表达式）：

import re
courses_subheading = driver.find_elements_by_xpath("(//tr[th[@class='header']])[1]/th")
headers = [re.sub('\s+',' ',el.text) for el in courses_subheading]
print(headers)

Output: Output：

['Course Weight', 'Title', 'Notes', 'Max Credits', 'OK Through', 'Disability Course']

删除<br>使用 selenium 和 python 进行网络抓取时标记正确的 alignment

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-08-03 12:52:31

解决方案2
0 2020-08-03 15:26:26

删除<br>使用 selenium 和 python 进行网络抓取时标记正确的 alignment

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-08-03 12:52:31

解决方案2 0 2020-08-03 15:26:26

解决方案1
0 已采纳 2020-08-03 12:52:31

解决方案2
0 2020-08-03 15:26:26