简体   繁体   English

Scrapy 不消耗所有start_urls

[英]Scrapy does not consume all start_urls

I have been struggling for already quite some time, and have been not able to solve it.我已经苦苦挣扎了一段时间,一直无法解决。 The problem is that I have a start_urls list of a few hundred URLs, but only a part of these URLs are consumed by the start_requests() of my spider.问题是我有一个包含几百个 URL 的 start_urls 列表,但是这些 URL 中只有一部分被我的蜘蛛的 start_requests() 使用了。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    
    #SETTINGS
    name = 'example'
    allowed_domains = []
    start_urls = []
                
    #set rules for links to follow        
    link_follow_extractor = LinkExtractor(allow=allowed_domains,unique=True) 
    rules = (Rule(link_follow_extractor, callback='parse', process_request = 'process_request', follow=True),) 

    def __init__(self,*args, **kwargs):
        super(MySpider, self).__init__(* args, ** kwargs)
        
        #urls to scrape
        self.start_urls = ['https://example1.com','https://example2.com']
        self.allowed_domains = ['example1.com','example2.com']          

    def start_requests(self):
                
        #create initial requests for urls in start_urls        
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse,priority=1000,meta={'priority':100,'start':True})
    
    def parse(self, response):
        print("parse")

I have read multiple post on StackOverflow on this issue, and some threads on Github (all the way back to 2015), but haven't been able to get it too work.我已经在 StackOverflow 上阅读了多篇关于这个问题的帖子,以及 Github 上的一些帖子(一直追溯到 2015 年),但一直无法让它发挥作用。

To my understanding the problem is that while I create my initial requests, other requests already have generated a response which is parsed and has created new requests that fill up the queue.据我了解,问题是当我创建初始请求时,其他请求已经生成了一个响应,该响应被解析并创建了填满队列的新请求。 I confirmed that this is my problem, as when I use a middleware to limit the number of pages to be downloaded per domain to 2, the issue seems to be resolved.我确认这是我的问题,因为当我使用中间件将每个域下载的页面数限制为 2 时,问题似乎已解决。 This would make sense, as the first created request would only generated a few new requests, leaving space in the queue for the remainder of the start_urls list.这是有道理的,因为第一个创建的请求只会生成一些新请求,从而在队列中为 start_urls 列表的其余部分留出空间。

I also noticed when I reduce the concurrent requests from 32 to 2, even a smaller part of the start_urls list is consumed.我还注意到,当我将并发请求从 32 个减少到 2 个时,甚至会消耗 start_urls 列表的一小部分。 Increasing the number of concurrent request to a few hundred is not possible, as this leads to timeouts.将并发请求数增加到几百个是不可能的,因为这会导致超时。

It is still unclear why the spider shows this behavior and just doesn't continue with consuming the start_urls.目前还不清楚为什么蜘蛛会显示这种行为并且只是不继续使用 start_urls。 It would be much appreciated if someone could give me some pointers to a potential solution for this issue.如果有人能给我一些关于这个问题的潜在解决方案的建议,我将不胜感激。

I was struggling with the same issue: my crawler would never pass beyond the page 1 of any of the start_urls I defined.我一直在为同样的问题而苦苦挣扎:我的爬虫永远不会超出我定义的任何 start_urls 的第 1 页。

Besides the documentation saying that the CrawlSpider class uses it's own parse internally in every response, so you should never use a custom parse risking the spider not working anymore, what the documentation don't mention is that the parser used by the CrawlSpider class doesn't parse the start_urls (even though it requires that the start_urls to be parsed), so the spider works initially and fails with "there's no parse in the callback" error when trying to crawl to the next page/start_url.除了文档说 CrawlSpider class 在每个响应中都在内部使用它自己的解析器之外,所以你永远不应该使用自定义解析器冒着蜘蛛不再工作的风险,文档没有提到的是 CrawlSpider class 使用的解析器没有' t 解析 start_urls (即使它要求解析 start_urls),因此蜘蛛最初工作并在尝试爬行到下一页/start_url 时失败并出现“回调中没有解析”错误。

Long story short, try doing that (it worked for me): Add a parse function for the start_urls.长话短说,尝试这样做(对我有用):为 start_url 添加一个解析 function。 It like mine don't need to really do anything就像我不需要做任何事情

def parse(self, start_urls):
    for i in range(1, len(start_urls)):
        print('Starting to scrap page: '+ i)
    self.start_urls = start_urls

And here's follows my entire code (user agent is defined in settings of project):下面是我的整个代码(用户代理在项目设置中定义):

    from urllib.request import Request
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class PSSpider(CrawlSpider):
    name = 'jogos'
    allowed_domains = ['meugameusado.com.br']
    start_urls = ['https://www.meugameusado.com.br/playstation/playstation-3/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-4/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-2/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-5/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-vita/jogos?pagina=1'] 
    
    def parse(self, start_urls):
        for i in range(1, len(start_urls)):
            print('Starting to scrap page: '+ i)
        self.start_urls = start_urls

    rules = (
        Rule(LinkExtractor(allow=([r'/playstation/playstation-2/jogos?pagina=[1-999]',r'/playstation/playstation-3/jogos?pagina=[1-999]',
         r'/playstation/playstation-4/jogos?pagina=[1-999]', r'/playstation/playstation-5/jogos?pagina=[1-999]', r'/playstation/playstation-vita/jogos?pagina=[1-999]', 'jogo-'])
         ,deny=('/jogos-de-','/jogos?sort=','/jogo-de-','buscar?','-mega-drive','-sega-cd','-game-gear','-xbox','-x360','-xbox-360','-xbox-series','-nes','-gc','-gbc','-snes','-n64','-3ds','-wii','switch','-gamecube','-xbox-one','-gba','-ds',r'/nintendo*', r'/xbox*', r'/classicos*',r'/raridades*',r'/outros*'))
         ,callback='parse_item'
         ,follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.css('h1.nome-produto::text').get(),
            'price': response.css('span.desconto-a-vista strong::text').get(),
            'images': response.css('span > img::attr(data-largeimg)').getall(),
            'video': response.css('#playerVideo::attr("src")').get(),
            'descricao': response.xpath('//*[@id="descricao"]/h3[contains(text(),"ESPECIFICAÇÕES")]/preceding-sibling::p/text()').getall(),
            'especificacao1': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/strong/text()').getall(),
            'especificacao2': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/text()').getall(),
            'tags': response.xpath('//*[@id="descricao"]/h3[contains(text(),"TAGS")]/following-sibling::ul/li/a/text()').getall(),
            'url': response.url,
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM