如何确定从`yield scrapy.Request`返回的生成器是否有任何数据？

Question

In the Scrapy Tutorial , the spider extracts the next page links from class="next" and crawls them -在Scrapy 教程中，蜘蛛从class="next"中提取下一页链接并抓取它们 -

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

For my case, I can't find the next page links in the files downloaded from the webserver but I know the format is response.url concatenated with /page/[page number]/ .就我而言，我在从网络服务器下载的文件中找不到下一页链接，但我知道格式是response.url与/page/[page number]/连接。 Requested pages which don't yield quotes still return a response , for example - No quotes found!不产生报价的请求页面仍会返回response ，例如 - 未找到报价！ . . Since the number of next pages is normally less than 20, I could loop over all the possible urls by replacing last 3 lines of the spider with -由于下一页的数量通常少于 20，我可以通过将蜘蛛的最后 3 行替换为 - 来遍历所有可能的 url

for page_num in range(2, 20):
    yield response.follow(f"/page/{page_num}/", callback=self.parse)

However this forces the spider to request pages (such as http://quotes.toscrape.com/page/11 to 20) which don't yield quotes.然而，这迫使蜘蛛请求不产生报价的页面（例如http://quotes.toscrape.com/page/11到 20）。 How can I adjust my spider to terminate the page_num loop after requesting the first page which does not yield quotes?在请求不产生引号的第一页后，如何调整我的蜘蛛以终止page_num循环？ (such as http://quotes.toscrape.com/page/11 ) （如http://quotes.toscrape.com/page/11 ）

pseudo code -伪代码 -

    page_num = 2
    while (quotes are yielded from the response):
        yield response.follow(f"/page/{page_num}/", callback=self.parse)
        page_num += 1

Answer 1

You can use result of response.css('..') as condition to next page.您可以使用response.css('..')结果作为下一页的条件。
In this case Your code will be like this:在这种情况下，您的代码将如下所示：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        page_num = get_pagenumber_from_url(response.url)
        
        quotes_sel = response.css('div.quote'):
        # quotes_sel - will be SelectorList if page have item data
        # quotes_sel - will be None if page doesn't have item data
        for quote in quotes_sel:
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        if quotes_sel:
            next_page_url = f"/page/{str(page_num+1)}"
            yield response.follow(next_page_url , callback=self.parse)

如何确定从`yield scrapy.Request`返回的生成器是否有任何数据？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-25 10:58:13

如何确定从`yield scrapy.Request`返回的生成器是否有任何数据？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-25 10:58:13

解决方案1
1 已采纳 2021-03-25 10:58:13