Scrapy 不抓取“下一頁”數據（它只抓取了第一頁）

Question

我沒有從下一頁獲取數據（第一頁很好。）

我嘗試了如下所示的幾種方法（第一個我做了 robots_obey = false; download_delay = 8; 並更改了用戶代理。在第二個中，再次嘗試根據網站的用戶代理更改用戶代理，然后嘗試覆蓋請求標頭with that user agent, each time commenting previous one out, and robots_obey was again set to false. Platform is Python v 3.6. First method was tried on windows 10 and Ubuntu 18. Second, was tried only on windows.)

方法一

    # -*- coding: utf-8 -*-
import scrapy


class ScrapeDfo2Spider(scrapy.Spider):
    name = 'scrape-dfo2'
    allowed_domains = ['canada.ca']
    start_urls = [
        'https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=']

    def parse(self, response):
        quotes = response.xpath('//*[@class="h5"]')
        for quote in quotes:
            title = quote.xpath('.//a/text()').extract_first()
            link = quote.xpath('.//a/@href').extract_first()

            yield {'Title': title,
                   'Link': link}

        next_page_url = response.xpath('//a[@rel="next"]/@href').extract()
        if next_page_url:
            yield scrapy.Request(response.urljoin(next_page_url))

方法二

    # -*- coding: utf-8 -*-
import scrapy


class ScrapeDfo2Spider(scrapy.Spider):
    name = 'scrape-dfo2'
    allowed_domains = ['canada.ca']
    # start_urls = ['https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=']

    def start_requests(self):
        yield scrapy.Request(url='https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=', callback=self.parse, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})

    def parse(self, response):
        for quote in response.xpath('//*[@class="h5"]'):
            yield{
                'Title': quote.xpath('.//a/text()').get(),
                'Link': quote.xpath('.//a/@href').get(),
                'User-Agent': response.request.headers['User-Agent']}

        next_page_url = response.xpath('//a[@rel="next"]/@href').extract()
        if next_page_url:
            yield scrapy.Request(response.urljoin(next_page_url), headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})
    # def parse(self, response):
    #     quotes

Answer 1

我希望它可以幫助你。

# -*- coding: utf-8 -*-
import scrapy


class CanadaSpider(scrapy.Spider):
    name = 'canada'
    allowed_domains = ['canada.ca']
    start_urls = ['https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=0&dprtmnt=fisheriesoceans']

    page_count = 0

    def start_requests(self):
        for i in range(self.page_count, 690, 10):
            yield scrapy.Request('https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=%d&dprtmnt=fisheriesoceans'%i, callback=self.parse
            )

    def parse(self, response):

        quotes = response.xpath('//*[@class="h5"]')
        for quote in quotes:
            title = quote.xpath('.//a/text()').extract_first()
            link = quote.xpath('.//a/@href').extract_first()

            yield {'Title': title,
               'Link': link}

一些 output 在這里

{'Title': 'Canadian small businesses create innovative solutions to help reduce plastic pollution in our oceans', 'Link': 'https://www.canada.ca/en/fisheries-oceans/news/2020/06/canadian-small-businesses-create-innovative-solutions-to-help-reduce-plastic-pollution-in-our-oceans.html'}
2020-06-15 05:57:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=0&dprtmnt=fisheriesoceans>
{'Title': 'Government of Canada takes the fight against illegal fishing to outer space', 'Link': 'https://www.canada.ca/en/fisheries-oceans/news/2020/06/government-of-canada-takes-the-fight-against-illegal-fishing-to-outer-space.html'}
2020-06-15 05:57:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=0&dprtmnt=fisheriesoceans>
{'Title': 'Closed areas for shellfish harvesting on the North Shore', 'Link': 'https://www.canada.ca/en/fisheries-oceans/news/2020/06/closed-areas-for-shellfish-harvesting-on-the-north-shore.html'}

Answer 2

在下面回答我自己的問題（還沒有嘗試以上）

下面的答案是針對方法 2。其中一個關鍵部分是確保在最后，處理分頁的行位於 for 循環之外。

import scrapy

class ScrapeDfo2Spider(scrapy.Spider):
    name = 'scrape-dfo2'
    allowed_domains = ['www.canada.ca']
    # start_urls = ['https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=']

    def start_requests(self):
        yield scrapy.Request(url='https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=', callback=self.parse, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})

    def parse(self, response):
        for quote in response.xpath('//*[@class="h5"]'):
            yield{
                'Title': quote.xpath('.//a/text()').get(),
                'Link': quote.xpath('.//a/@href').get(),
                'User-Agent': response.request.headers['User-Agent']}

        next_page_url = response.xpath('//a[@rel="next"]/@href').get()

        if next_page_url:
            yield scrapy.Request(url=response.urljoin(next_page_url), headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})

Scrapy 不抓取“下一頁”數據（它只抓取了第一頁）

問題描述

2 個解決方案

解決方案1
0 已采納 2020-06-15 00:02:05

一些 output 在這里

解決方案2
0 2020-06-15 04:11:57

Scrapy 不抓取“下一頁”數據（它只抓取了第一頁）

問題描述

2 個解決方案

解決方案1 0 已采納 2020-06-15 00:02:05

一些 output 在這里

解決方案2 0 2020-06-15 04:11:57

解決方案1
0 已采納 2020-06-15 00:02:05

解決方案2
0 2020-06-15 04:11:57