[英]Can't go on clicking on the next page button while scraping certain fields from a website
[英]Can't scrape titles from a website while clicking on the next page button
我用python
與selenium
結合編寫了一個腳本,以在單擊下一頁按鈕時從不同頁面抓取不同文章的鏈接,並從其內部頁面獲取每個文章的標題。 盡管我要在此處處理的內容是靜態內容,但我還是使用了硒來查看它如何在單擊下一頁時解析項目。 I'm only after any soultion related to selenium.
如果我定義一個空白列表並擴展到該列表的所有鏈接,那么最終我可以解析所有標題,從而在單擊下一頁按鈕時重用其內部頁面中的那些鏈接,但這不是我想要的。
但是,我打算做的是從每個頁面收集所有鏈接,並在單擊下一頁按鈕時從其內部頁面解析每個帖子的標題。 簡而言之,我希望同時做兩件事。
我嘗試過:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
driver.execute_script("arguments[0].scrollIntoView();",elem)
elem.click()
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
當我運行上面的腳本時,它通過重用首頁上的鏈接來解析不同文章的標題,但是當遇到elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
raise TimeoutException(message, screen, stacktrace)
時raise TimeoutException(message, screen, stacktrace)
拋出該錯誤會raise TimeoutException(message, screen, stacktrace)
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
行。
如何從其首頁的收集頁面中刮取每個帖子的標題,然后單擊下一頁按鈕以重復此過程直到完成?
之所以沒有下一個按鈕,是因為遍歷該循環末尾的每個內部鏈接時,找不到下一個按鈕。
您需要像下面那樣獲取每個nexturl並執行。
urlnext =' https ://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page= {}&pagesize = 30'.format(pageno)#其中頁面將從2開始
嘗試下面的代碼。
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
npage = 2
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
driver.get(urlnext.format(npage))
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
npage=npage+1
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.