单击下一页按钮时无法从网站上抓取标题

Question

I've written a script in python in combination with selenium to scrape the links of different posts from different pages while clicking on the next page button and get the title of each post from its inner page. 我用python与selenium结合编写了一个脚本，以在单击下一页按钮时从不同页面抓取不同文章的链接，并从其内部页面获取每个文章的标题。 Although the content I'm trying to deal here are static ones, I used selenium to see how it parses items while clicking on the next pages. 尽管我要在此处处理的内容是静态内容，但我还是使用了硒来查看它如何在单击下一页时解析项目。 I'm only after any soultion related to selenium.

Website address 网站地址

If I define a blank list and extend all the links to it then eventually I can parse all the titles reusing those links from their inner pages when clicking on the next page button is done but that is not what I want. 如果我定义一个空白列表并扩展到该列表的所有链接，那么最终我可以解析所有标题，从而在单击下一页按钮时重用其内部页面中的那些链接，但这不是我想要的。

However, what I intend to do is collect all the links from each of the pages and parse title of each post from their inner pages while clicking on the next page button. 但是，我打算做的是从每个页面收集所有链接，并在单击下一页按钮时从其内部页面解析每个帖子的标题。 In short, I wish do the two things simultaneously. 简而言之，我希望同时做两件事。

I've tried with: 我尝试过：

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)

        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            driver.execute_script("arguments[0].scrollIntoView();",elem)
            elem.click()
            time.sleep(2)
        except Exception:
            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)
    for item in get_links(link):
        print(item)

When I run the above script, It parses the title of different posts by reusing the link from the first page but breaks throwing this error raise TimeoutException(message, screen, stacktrace) when it hits this elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']"))) line. 当我运行上面的脚本时，它通过重用首页上的链接来解析不同文章的标题，但是当遇到elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']"))) raise TimeoutException(message, screen, stacktrace)时raise TimeoutException(message, screen, stacktrace)抛出该错误会raise TimeoutException(message, screen, stacktrace) elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))行。

How can scrape the title of each post from their inner pages collecting link from first page and then click on the next page button in order to repeat the process until it is done? 如何从其首页的收集页面中刮取每个帖子的标题，然后单击下一页按钮以重复此过程直到完成？

Answer 1

The reason you are getting no next button because when traverse each inner link at the end of that loop it can't find the next button. 之所以没有下一个按钮，是因为遍历该循环末尾的每个内部链接时，找不到下一个按钮。

You need to take each nexturl like below and execute. 您需要像下面那样获取每个nexturl并执行。

urlnext = ' https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page= {}&pagesize=30'.format(pageno) #where page will start from 2 urlnext =' https ://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page= {}＆pagesize = 30'.format（pageno）＃其中页面将从2开始

Try below code. 尝试下面的代码。

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
    npage = 2
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)
        driver.get(urlnext.format(npage))
        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            npage=npage+1
            time.sleep(2)
        except Exception:

            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)

    for item in get_links(link):
        print(item)

单击下一页按钮时无法从网站上抓取标题

问题描述

1 个解决方案

解决方案1
1 2019-08-01 11:08:52

单击下一页按钮时无法从网站上抓取标题

问题描述

1 个解决方案

解决方案1 1 2019-08-01 11:08:52

解决方案1
1 2019-08-01 11:08:52