简体   繁体   English

单击下一页按钮时无法从网站上抓取标题

[英]Can't scrape titles from a website while clicking on the next page button

I've written a script in python in combination with selenium to scrape the links of different posts from different pages while clicking on the next page button and get the title of each post from its inner page. 我用pythonselenium结合编写了一个脚本,以在单击下一页按钮时从不同页面抓取不同文章的链接,并从其内部页面获取每个文章的标题。 Although the content I'm trying to deal here are static ones, I used selenium to see how it parses items while clicking on the next pages. 尽管我要在此处处理的内容是静态内容,但我还是使用了硒来查看它如何在单击下一页时解析项目。 I'm only after any soultion related to selenium.

Website address 网站地址

If I define a blank list and extend all the links to it then eventually I can parse all the titles reusing those links from their inner pages when clicking on the next page button is done but that is not what I want. 如果我定义一个空白列表并扩展到该列表的所有链接,那么最终我可以解析所有标题,从而在单击下一页按钮时重用其内部页面中的那些链接,但这不是我想要的。

However, what I intend to do is collect all the links from each of the pages and parse title of each post from their inner pages while clicking on the next page button. 但是,我打算做的是从每个页面收集所有链接,并在单击下一页按钮时从其内部页面解析每个帖子的标题。 In short, I wish do the two things simultaneously. 简而言之,我希望同时做两件事。

I've tried with: 我尝试过:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)

        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            driver.execute_script("arguments[0].scrollIntoView();",elem)
            elem.click()
            time.sleep(2)
        except Exception:
            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)
    for item in get_links(link):
        print(item)

When I run the above script, It parses the title of different posts by reusing the link from the first page but breaks throwing this error raise TimeoutException(message, screen, stacktrace) when it hits this elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']"))) line. 当我运行上面的脚本时,它通过重用首页上的链接来解析不同文章的标题,但是当遇到elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']"))) raise TimeoutException(message, screen, stacktrace)raise TimeoutException(message, screen, stacktrace)抛出该错误会raise TimeoutException(message, screen, stacktrace) elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))行。

How can scrape the title of each post from their inner pages collecting link from first page and then click on the next page button in order to repeat the process until it is done? 如何从其首页的收集页面中刮取每个帖子的标题,然后单击下一页按钮以重复此过程直到完成?

The reason you are getting no next button because when traverse each inner link at the end of that loop it can't find the next button. 之所以没有下一个按钮,是因为遍历该循环末尾的每个内部链接时,找不到下一个按钮。

You need to take each nexturl like below and execute. 您需要像下面那样获取每个nexturl并执行。

urlnext = ' https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page= {}&pagesize=30'.format(pageno) #where page will start from 2 urlnext =' https ://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page= {}&pagesize = 30'.format(pageno)#其中页面将从2开始

Try below code. 尝试下面的代码。

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(url):
    urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
    npage = 2
    driver.get(url)
    while True:
        items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
        yield from get_info(items)
        driver.get(urlnext.format(npage))
        try:
            elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
            npage=npage+1
            time.sleep(2)
        except Exception:

            break

def get_info(links):
    for link in links:
        driver.get(link)
        name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
        yield name

if __name__ == '__main__':
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)

    for item in get_links(link):
        print(item)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从网站抓取某些字段时无法继续单击下一页按钮 - Can't go on clicking on the next page button while scraping certain fields from a website 解析链接时无法继续单击下一页按钮 - Can't keep clicking on the next page button while parsing the links 无法从网页上抓取类别标题 - Can't scrape category titles from a webpage Web 使用 selenium 从 IMDB 抓取记录并单击下一页 - Web scrape records from IMDB using selenium and clicking next page 无法从网页上抓取不同项目的标题 - Can't scrape the titles of different items from a webpage 单击“下一页”按钮时,抓取 URL 不会更改的网站 - Scraping a website that URL doesn't change when clicking on "next page" button 似乎无法从该网站上抓取 tbody - Can't seem to scrape tbody from this website python 从 url 或元素按钮抓取页面 + 下一页循环 - python scrape page + next page loop from url or element button 我正在尝试从本网站上的 PDF 中抓取标题。 但是,我得到了标题和链接。 为什么以及如何解决这个问题? - I am trying to scrape the titles from the PDFs on this website. However, I get the titles and the links. Why and how can I fix this? 当使用请求和beautifulsoup加载更多内容时,我无法抓取下一页上url不会更改的网站 - I can't scrape a website where url not change on its next page when load more using requests and beautifulsoup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM