刪除<br>使用 selenium 和 python 進行網絡抓取時標記正確的 alignment

Question

我想刪除<br> html 標簽，同時 web 抓取頁面，但替換似乎不起作用。 我不確定是否有其他方法或更好的方法使用 selenium 和 python。提前謝謝你。

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("drivers/chromedriver")

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Hampshire")

driver.find_element_by_id("city").send_keys("Moultonborough")
driver.find_element_by_id("name").send_keys("Moultonborough Academy")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

courses_subheading = driver.find_elements_by_tag_name("th.header")

print(courses_subheading[0].text, "     " ,courses_subheading[1].text, "     ", courses_subheading[2].text, "     ", courses_subheading[3].text, "     ", courses_subheading[4].text

我試過這個：

for i in courses_subheading:
    courses_subheading.replace("<br>", " ")

但得到一個錯誤： AttributeError: 'list' object has no attribute 'replace'

目前，它看起來像這樣：

Course
Weight     Title     Notes     Max
Credits       OK
Through       Disability
Course

但我想要這樣：

Course Weight     Title     Notes     Max Credits     OK     Through     Disability Course

Answer 1

您可以輕松避免使用<br>標簽，而不是刪除<br> 。 要打印表格標題，例如Title 、 Notes等，您需要為visibility_of_all_elements_located()引入WebDriverWait並且您可以使用以下任一定位器策略：

使用css_selector ：

 driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_css_selector("input#city").send_keys("Moultonborough") driver.find_element_by_css_selector("input#name").send_keys("Moultonborough Academy") driver.find_element_by_css_selector("input[value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#approvedCourseTable_1 th.header")))])

使用xpath ：

 driver.get("https://web3.ncaa.org/hsportal/exec/hsAction") Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire") driver.find_element_by_xpath("//input[@id='city']").send_keys("Moultonborough") driver.find_element_by_xpath("//input[@id='name']").send_keys("Moultonborough Academy") driver.find_element_by_xpath("//input[@value='Search']").click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='hsCode']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='approvedCourseTable_1']//th[@class='header']")))])

控制台 Output：

 ['Course\nWeight', 'Title', 'Notes', 'Max\nCredits', 'OK\nThrough', 'Disability\nCourse']

注意：您必須添加以下導入：

 from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

Answer 2

要完成，如果你真的想刪除br標簽，你可以使用（我已經修復了你的 XPath 表達式）：

import re
courses_subheading = driver.find_elements_by_xpath("(//tr[th[@class='header']])[1]/th")
headers = [re.sub('\s+',' ',el.text) for el in courses_subheading]
print(headers)

Output：

['Course Weight', 'Title', 'Notes', 'Max Credits', 'OK Through', 'Disability Course']

刪除<br>使用 selenium 和 python 進行網絡抓取時標記正確的 alignment

問題描述

2 個解決方案

解決方案1
0 已采納 2020-08-03 12:52:31

解決方案2
0 2020-08-03 15:26:26

刪除<br>使用 selenium 和 python 進行網絡抓取時標記正確的 alignment

問題描述

2 個解決方案

解決方案1 0 已采納 2020-08-03 12:52:31

解決方案2 0 2020-08-03 15:26:26

解決方案1
0 已采納 2020-08-03 12:52:31

解決方案2
0 2020-08-03 15:26:26