Scrapy start_requests() didn't yield all requests

Question

    def start_requests(self):
        db = SeedUserGenerator()
        result = db.selectSeedUsers()
        db.closeDB()
        urls = []
        for name in result:
            urls.append(self.user_info_url.format(name))
        for url in urls:
            yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
        print('fin')

    def parse_user(self, response):
        .........ignore some code here...........
        yield Request(url=next_url, priority=20, callback=self.parse_info)

    def parse_info(self, response):
        .........ignore some code here...........
        yield Request(url=next_url, priority=30, callback=self.parse_user)

The program runs as follows:

several Requests yields from start_requests , and the function start_requests seems to be paused without outputing the string fin .
a response comes, and the function parse_user yield another Request, but the remaining Requests in the function start_requests can not be yield until the response has been processed, and here the yield operation formed a ring.

It seems to be synchronous: Before sending a Request from start_requests and processing its response, other Requests can not be yield?

Is that mean scrapy can never yield the remaining Requests in the function start_requests ?

How could I make scrapy finish running start_requests first?

I'm new in python and scrapy. Can scrapy process a response and yield Requests at the same time?

By the way, I'm using Python3.6 and Scrapy1.5.1 Twisted 20.3.0

Answer 1

I solved my problem by referring to the source code of Scrapy engine:

    def _next_request(self, spider):
        slot = self.slot
        if not slot:
            return

        if self.paused:
            return

        while not self._needs_backout(spider):
            if not self._next_request_from_scheduler(spider):
                break

        if slot.start_requests and not self._needs_backout(spider):
            try:
                request = next(slot.start_requests)
            except StopIteration:
                slot.start_requests = None
            except Exception:
                slot.start_requests = None
                logger.error('Error while obtaining start requests',
                             exc_info=True, extra={'spider': spider})
            else:
                self.crawl(request, spider)

        if self.spider_is_idle(spider) and slot.close_if_idle:
            self._spider_idle(spider)

Here Scrapy always tries to get requests from scheduler's queues first, rather than start_requests.

What's more, Scrapy never put all requests of function start_requests first.

So, I change my code like this:

    def start_requests(self):
        db = SeedUserGenerator()
        result = db.selectSeedUsers()
        db.closeDB()
        urls = []
        for name in result:
            urls.append(self.user_info_url.format(name))
        yield Request(url=urls[0], callback=self.parse_temp, dont_filter=True, priority=10, meta={'urls': urls})

    def parse_temp(self, response):
        urls = response.meta['urls']
        for url in urls:
            print(url)
            yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
        print('fin2')

Then Scrapy put all requests into the queues first.

Scrapy start_requests() didn't yield all requests

Question

1 answers

solution1
0 ACCPTED 2021-02-21 08:08:33

Scrapy start_requests() didn't yield all requests

Question

1 answers

solution1 0 ACCPTED 2021-02-21 08:08:33

solution1
0 ACCPTED 2021-02-21 08:08:33