简体   繁体   中英

Can't scrape titles from a website while clicking on the next page button

I've written a script in python in combination with selenium to scrape the links of different posts from different pages while clicking on the next page button and get the title of each post from its inner page. Although the content I'm trying to deal here are static ones, I used selenium to see how it parses items while clicking on the next pages. I'm only after any soultion related to selenium.

Website address

If I define a blank list and extend all the links to it then eventually I can parse all the titles reusing those links from their inner pages when clicking on the next page button is done but that is not what I want.

However, what I intend to do is collect all the links from each of the pages and parse title of each post from their inner pages while clicking on the next page button. In short, I wish do the two things simultaneously.

I've tried with:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)

        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            driver.execute_script("arguments[0].scrollIntoView();",elem)
            elem.click()
            time.sleep(2)
        except Exception:
            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)
    for item in get_links(link):
        print(item)

When I run the above script, It parses the title of different posts by reusing the link from the first page but breaks throwing this error raise TimeoutException(message, screen, stacktrace) when it hits this elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']"))) line.

How can scrape the title of each post from their inner pages collecting link from first page and then click on the next page button in order to repeat the process until it is done?

The reason you are getting no next button because when traverse each inner link at the end of that loop it can't find the next button.

You need to take each nexturl like below and execute.

urlnext = ' https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page= {}&pagesize=30'.format(pageno) #where page will start from 2

Try below code.

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
    npage = 2
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)
        driver.get(urlnext.format(npage))
        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            npage=npage+1
            time.sleep(2)
        except Exception:

            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)

    for item in get_links(link):
        print(item)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM