簡體   English   中英

Scrapy 不抓取“下一頁”數據(它只抓取了第一頁)

[英]Scrapy does not scrape 'next page' data (it scraped only the first page)

我沒有從下一頁獲取數據(第一頁很好。)

我嘗試了如下所示的幾種方法(第一個我做了 robots_obey = false; download_delay = 8; 並更改了用戶代理。在第二個中,再次嘗試根據網站的用戶代理更改用戶代理,然后嘗試覆蓋請求標頭with that user agent, each time commenting previous one out, and robots_obey was again set to false. Platform is Python v 3.6. First method was tried on windows 10 and Ubuntu 18. Second, was tried only on windows.)

方法一

    # -*- coding: utf-8 -*-
import scrapy


class ScrapeDfo2Spider(scrapy.Spider):
    name = 'scrape-dfo2'
    allowed_domains = ['canada.ca']
    start_urls = [
        'https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=']

    def parse(self, response):
        quotes = response.xpath('//*[@class="h5"]')
        for quote in quotes:
            title = quote.xpath('.//a/text()').extract_first()
            link = quote.xpath('.//a/@href').extract_first()

            yield {'Title': title,
                   'Link': link}

        next_page_url = response.xpath('//a[@rel="next"]/@href').extract()
        if next_page_url:
            yield scrapy.Request(response.urljoin(next_page_url))

方法二

    # -*- coding: utf-8 -*-
import scrapy


class ScrapeDfo2Spider(scrapy.Spider):
    name = 'scrape-dfo2'
    allowed_domains = ['canada.ca']
    # start_urls = ['https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=']

    def start_requests(self):
        yield scrapy.Request(url='https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=', callback=self.parse, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})

    def parse(self, response):
        for quote in response.xpath('//*[@class="h5"]'):
            yield{
                'Title': quote.xpath('.//a/text()').get(),
                'Link': quote.xpath('.//a/@href').get(),
                'User-Agent': response.request.headers['User-Agent']}

        next_page_url = response.xpath('//a[@rel="next"]/@href').extract()
        if next_page_url:
            yield scrapy.Request(response.urljoin(next_page_url), headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})
    # def parse(self, response):
    #     quotes

我希望它可以幫助你。

# -*- coding: utf-8 -*-
import scrapy


class CanadaSpider(scrapy.Spider):
    name = 'canada'
    allowed_domains = ['canada.ca']
    start_urls = ['https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=0&dprtmnt=fisheriesoceans']

    page_count = 0

    def start_requests(self):
        for i in range(self.page_count, 690, 10):
            yield scrapy.Request('https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=%d&dprtmnt=fisheriesoceans'%i, callback=self.parse
            )

    def parse(self, response):

        quotes = response.xpath('//*[@class="h5"]')
        for quote in quotes:
            title = quote.xpath('.//a/text()').extract_first()
            link = quote.xpath('.//a/@href').extract_first()

            yield {'Title': title,
               'Link': link}

一些 output 在這里

{'Title': 'Canadian small businesses create innovative solutions to help reduce plastic pollution in our oceans', 'Link': 'https://www.canada.ca/en/fisheries-oceans/news/2020/06/canadian-small-businesses-create-innovative-solutions-to-help-reduce-plastic-pollution-in-our-oceans.html'}
2020-06-15 05:57:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=0&dprtmnt=fisheriesoceans>
{'Title': 'Government of Canada takes the fight against illegal fishing to outer space', 'Link': 'https://www.canada.ca/en/fisheries-oceans/news/2020/06/government-of-canada-takes-the-fight-against-illegal-fishing-to-outer-space.html'}
2020-06-15 05:57:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.canada.ca/en/news/advanced-news-search/news-results.html?start=&typ=newsreleases&end=&idx=0&dprtmnt=fisheriesoceans>
{'Title': 'Closed areas for shellfish harvesting on the North Shore', 'Link': 'https://www.canada.ca/en/fisheries-oceans/news/2020/06/closed-areas-for-shellfish-harvesting-on-the-north-shore.html'}

在下面回答我自己的問題(還沒有嘗試以上)

下面的答案是針對方法 2。其中一個關鍵部分是確保在最后,處理分頁的行位於 for 循環之外。

import scrapy

class ScrapeDfo2Spider(scrapy.Spider):
    name = 'scrape-dfo2'
    allowed_domains = ['www.canada.ca']
    # start_urls = ['https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=']

    def start_requests(self):
        yield scrapy.Request(url='https://www.canada.ca/en/news/advanced-news-search/news-results.html?typ=newsreleases&dprtmnt=fisheriesoceans&start=&end=', callback=self.parse, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})

    def parse(self, response):
        for quote in response.xpath('//*[@class="h5"]'):
            yield{
                'Title': quote.xpath('.//a/text()').get(),
                'Link': quote.xpath('.//a/@href').get(),
                'User-Agent': response.request.headers['User-Agent']}

        next_page_url = response.xpath('//a[@rel="next"]/@href').get()

        if next_page_url:
            yield scrapy.Request(url=response.urljoin(next_page_url), headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'})

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM