简体   繁体   English

Scrapy等待function完成

[英]Scrapy wait for function to finish

Is there a way to wait for the loop to finish all the scrapy.Request and then do something?有没有办法等待循环完成所有 scrapy.Request 然后做一些事情? In this case i'd like to yield the payload after the for loop since i don't want to yield the payload in each pagination (parse_stores), i want it only on the parse function after the for.在这种情况下,我想在 for 循环之后产生有效负载,因为我不想在每个分页(parse_stores)中产生有效负载,我只想在 for 之后的解析 function 上产生它。

import scrapy
import json
import math


class UnitsSpider(scrapy.Spider):
    """ docstrings """
    name = 'units'

    def start_requests(self):

        urls = [{
            "url":
            "https://url.com/"
        }, {
            "url":
            "https://url2.com"
        }]

        for url in urls:
            yield scrapy.Request(
                url['url'],
                meta={
                    'playwright': True,
                }
            )

    def parse(self, response):

        url = response.url

        data = response.css('script#__NEXT_DATA__::text').get()
        json_data = json.loads(data)

        total_pages = math.ceil(
            json_data['props']['pageProps']['totalStores'] / 50)

        payload = {
            'base_url': url,
            'stores': 0,
            'stores_data': []
        }

        for page in range(0, total_pages):
            next_page = f'{url}{page + 1}'

            req = response.follow(url=next_page)

            yield scrapy.Request(url=req.url, callback=self.parse_stores, meta={
                'playwright': True}, cb_kwargs={'payload': payload})

        # here after all the requests are done i'd like to do something    

    def parse_stores(self, response, payload):
        data = response.css('script#__NEXT_DATA__::text').get()
        json_data = json.loads(data)

        stores = json_data['props']['pageProps']['cityData']['stores']

        payload['stores'] += len(stores)
        # append stores to stores_data
        payload['url'] = response.url
        yield payload

If I understood correct, yes you can.如果我理解正确,是的,你可以。

The scraper will close and you want to yield the results then instead of yielding after every iteration. scraper 将关闭,您希望在每次迭代后产生结果而不是产生结果。

in your pipelines.py you need something like this在你的 pipelines.py 中你需要这样的东西

    def close_spider(self, spider):
        yield {"data": self.data}

make sure you defined the item_pipeline in your settings to make it work.确保您在设置中定义了item_pipeline以使其工作。

More information: https://docs.scrapy.org/en/latest/topics/item-pipeline.html更多信息: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM