Scrapy does not consume all start_urls

Question

I have been struggling for already quite some time, and have been not able to solve it. The problem is that I have a start_urls list of a few hundred URLs, but only a part of these URLs are consumed by the start_requests() of my spider.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    
    #SETTINGS
    name = 'example'
    allowed_domains = []
    start_urls = []
                
    #set rules for links to follow        
    link_follow_extractor = LinkExtractor(allow=allowed_domains,unique=True) 
    rules = (Rule(link_follow_extractor, callback='parse', process_request = 'process_request', follow=True),) 

    def __init__(self,*args, **kwargs):
        super(MySpider, self).__init__(* args, ** kwargs)
        
        #urls to scrape
        self.start_urls = ['https://example1.com','https://example2.com']
        self.allowed_domains = ['example1.com','example2.com']          

    def start_requests(self):
                
        #create initial requests for urls in start_urls        
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse,priority=1000,meta={'priority':100,'start':True})
    
    def parse(self, response):
        print("parse")

I have read multiple post on StackOverflow on this issue, and some threads on Github (all the way back to 2015), but haven't been able to get it too work.

To my understanding the problem is that while I create my initial requests, other requests already have generated a response which is parsed and has created new requests that fill up the queue. I confirmed that this is my problem, as when I use a middleware to limit the number of pages to be downloaded per domain to 2, the issue seems to be resolved. This would make sense, as the first created request would only generated a few new requests, leaving space in the queue for the remainder of the start_urls list.

I also noticed when I reduce the concurrent requests from 32 to 2, even a smaller part of the start_urls list is consumed. Increasing the number of concurrent request to a few hundred is not possible, as this leads to timeouts.

It is still unclear why the spider shows this behavior and just doesn't continue with consuming the start_urls. It would be much appreciated if someone could give me some pointers to a potential solution for this issue.

Answer 1

I was struggling with the same issue: my crawler would never pass beyond the page 1 of any of the start_urls I defined.

Besides the documentation saying that the CrawlSpider class uses it's own parse internally in every response, so you should never use a custom parse risking the spider not working anymore, what the documentation don't mention is that the parser used by the CrawlSpider class doesn't parse the start_urls (even though it requires that the start_urls to be parsed), so the spider works initially and fails with "there's no parse in the callback" error when trying to crawl to the next page/start_url.

Long story short, try doing that (it worked for me): Add a parse function for the start_urls. It like mine don't need to really do anything

def parse(self, start_urls):
    for i in range(1, len(start_urls)):
        print('Starting to scrap page: '+ i)
    self.start_urls = start_urls

And here's follows my entire code (user agent is defined in settings of project):

    from urllib.request import Request
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class PSSpider(CrawlSpider):
    name = 'jogos'
    allowed_domains = ['meugameusado.com.br']
    start_urls = ['https://www.meugameusado.com.br/playstation/playstation-3/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-4/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-2/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-5/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-vita/jogos?pagina=1'] 
    
    def parse(self, start_urls):
        for i in range(1, len(start_urls)):
            print('Starting to scrap page: '+ i)
        self.start_urls = start_urls

    rules = (
        Rule(LinkExtractor(allow=([r'/playstation/playstation-2/jogos?pagina=[1-999]',r'/playstation/playstation-3/jogos?pagina=[1-999]',
         r'/playstation/playstation-4/jogos?pagina=[1-999]', r'/playstation/playstation-5/jogos?pagina=[1-999]', r'/playstation/playstation-vita/jogos?pagina=[1-999]', 'jogo-'])
         ,deny=('/jogos-de-','/jogos?sort=','/jogo-de-','buscar?','-mega-drive','-sega-cd','-game-gear','-xbox','-x360','-xbox-360','-xbox-series','-nes','-gc','-gbc','-snes','-n64','-3ds','-wii','switch','-gamecube','-xbox-one','-gba','-ds',r'/nintendo*', r'/xbox*', r'/classicos*',r'/raridades*',r'/outros*'))
         ,callback='parse_item'
         ,follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.css('h1.nome-produto::text').get(),
            'price': response.css('span.desconto-a-vista strong::text').get(),
            'images': response.css('span > img::attr(data-largeimg)').getall(),
            'video': response.css('#playerVideo::attr("src")').get(),
            'descricao': response.xpath('//*[@id="descricao"]/h3[contains(text(),"ESPECIFICAÇÕES")]/preceding-sibling::p/text()').getall(),
            'especificacao1': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/strong/text()').getall(),
            'especificacao2': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/text()').getall(),
            'tags': response.xpath('//*[@id="descricao"]/h3[contains(text(),"TAGS")]/following-sibling::ul/li/a/text()').getall(),
            'url': response.url,
        }

Scrapy does not consume all start_urls

Question

1 answers

solution1
0 2022-03-20 07:24:29

Scrapy does not consume all start_urls

Question

1 answers

solution1 0 2022-03-20 07:24:29

solution1
0 2022-03-20 07:24:29