简体   繁体   English

Scrapy,限制start_url

[英]Scrapy, limit on start_url

I am wondering whether there is a limit on the number of start_urls I can assign to my spider? 我想知道我可以分配给我的蜘蛛的start_urls数量是否有限制? As far as I've searched, there seems to be no documentation on the limit of the list. 据我搜索,似乎没有关于列表限制的文档。

Currently I have set my spider so that the list of start_urls is read in from a csv file. 目前我已经设置了我的蜘蛛,以便从csv文件中读取start_urls列表。 The number of urls is around 1,000,000. 网址数量约为1,000,000。

There isn't a limit per se but you probably want to limit it yourself, otherwise you might end up with memory problems. 本身没有限制,但您可能希望自己限制它,否则您最终可能会遇到内存问题。
What can happen is all those 1M urls will be scheduled to scrapy scheduler and since python objects are quite a bit heavier than plain strings you'll end up running out of memory. 可能发生的是所有这些1M网址将被安排到scrapy调度程序,因为python对象比普通字符串重得多,你最终会耗尽内存。

To avoid this you can batch your start urls with spider_idle signal: 为避免这种情况,您可以使用spider_idle信号批量启动URL:

class MySpider(Spider):
    name = "spider"
    urls = []
    batch_size = 10000

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = cls(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle_consume, signals.spider_idle)
        return spider 

    def __init__(self, crawler):
        self.crawler = crawler
        self.urls = [] # read from file

    def start_requests(self):
        for i in range(self.batch_size):
            url = self.urls.pop(0)
            yield Request(url)


    def parse(self, response):
        pass
        # parse

    def idle_consume(self):
        """
        Everytime spider is about to close check our urls 
        buffer if we have something left to crawl
        """
        reqs = self.start_requests()
        if not reqs:
            return
        logging.info('Consuming batch')
        for req in reqs:
            self.crawler.engine.schedule(req, self)
        raise DontCloseSpider

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM