![](/img/trans.png)
[英]Can't go on clicking on the next page button while scraping certain fields from a website
[英]Can't scrape titles from a website while clicking on the next page button
我用python
与selenium
结合编写了一个脚本,以在单击下一页按钮时从不同页面抓取不同文章的链接,并从其内部页面获取每个文章的标题。 尽管我要在此处处理的内容是静态内容,但我还是使用了硒来查看它如何在单击下一页时解析项目。 I'm only after any soultion related to selenium.
如果我定义一个空白列表并扩展到该列表的所有链接,那么最终我可以解析所有标题,从而在单击下一页按钮时重用其内部页面中的那些链接,但这不是我想要的。
但是,我打算做的是从每个页面收集所有链接,并在单击下一页按钮时从其内部页面解析每个帖子的标题。 简而言之,我希望同时做两件事。
我尝试过:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
driver.execute_script("arguments[0].scrollIntoView();",elem)
elem.click()
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
当我运行上面的脚本时,它通过重用首页上的链接来解析不同文章的标题,但是当遇到elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
raise TimeoutException(message, screen, stacktrace)
时raise TimeoutException(message, screen, stacktrace)
抛出该错误会raise TimeoutException(message, screen, stacktrace)
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
行。
如何从其首页的收集页面中刮取每个帖子的标题,然后单击下一页按钮以重复此过程直到完成?
之所以没有下一个按钮,是因为遍历该循环末尾的每个内部链接时,找不到下一个按钮。
您需要像下面那样获取每个nexturl并执行。
urlnext =' https ://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page= {}&pagesize = 30'.format(pageno)#其中页面将从2开始
尝试下面的代码。
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
npage = 2
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
driver.get(urlnext.format(npage))
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
npage=npage+1
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.