具有长start_urls列表的Scrapy Crawling URL的顺序和来自spider的urls yiels

Question

Help! 救命！ Reading the source code of Scrapy is not easy for me. 阅读Scrapy的源代码对我来说并不容易。 I have a very long start_urls list. 我有一个很长的start_urls列表。 it is about 3,000,000 in a file. 它在文件中约为3,000,000。 So,I make the start_urls like this: 所以，我像这样制作start_urls ：

start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
    with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
        for line in f:
            try:
                url = line.strip()
                yield url
            except:
                print u"read line:%s from file failed!" % line
                continue
    print u"file read finish!"

MeanWhile, my spider's callback functions are like this： MeanWhile，我的蜘蛛的回调函数是这样的：

  def parse(self, response):
        self.log("Visited %s" % response.url)
        return  Request(url=("http://www.baidu.com"), callback=self.just_test1)
    def just_test1(self, response):
        self.log("Visited %s" % response.url)
        return Request(url=("http://www.163.com"), callback=self.just_test2)
    def just_test2(self, response):
        self.log("Visited %s" % response.url)
        return []

my questions are: 我的问题是：

the order of the urls used by downloader? 下载器使用的网址顺序？ Will the requests made by just_test1 , just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No) just_test1 ， just_test2的请求just_test1仅在使用了所有start_urls之后由下载程序使用？（我已经做了一些测试，似乎答案是否定的）
What decides the order? 什么决定订单？ Why and How is this order? 为什么以及如何订购？ How can we control it? 我们怎样才能控制它？
Is this a good way to deal with so many urls which are already in a file? 这是处理已经在文件中的这么多网址的好方法吗？ What else? 还有什么？

Thank you very much!!! 非常感谢你！！！

Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests. 谢谢你的回答。但我仍然有点困惑：默认情况下，Scrapy使用LIFO队列来存储待处理的请求。

The requests made by spiders' callback function will be given to the scheduler .Who does the same thing to start_url's requests ?The spider start_requests() function only generate an iterator without giving the real requests. spiders的回调函数发出的requests将被提供给scheduler 。对start_url's requests执行相同start_url's requests吗？spider start_requests()函数只生成迭代器而不提供实际请求。
Will the all requests (start_url's and callback's) be in the same request's queue?How many queues are there in Scrapy ? 所有requests （start_url和callback）是否都在同一个请求的队列中？ Scrapy中有多少个队列？

Answer 1

First of all, please see this thread - I think you'll find all the answers there. 首先，请看这个主题 - 我想你会在那里找到所有的答案。

the order of the urls used by downloader? 下载器使用的网址顺序？ Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No) Just_test1，just_test2的请求是否仅在使用了所有start_urls之后由下载程序使用？（我已经做了一些测试，似乎答案是否定的）

You are right, the answer is No . 你是对的，答案是No 。 The behavior is completely asynchronous: when the spider starts, start_requests method is called ( source ): 行为是完全异步的：当蜘蛛启动时，调用start_requests方法（源）：

def start_requests(self):
    for url in self.start_urls:
        yield self.make_requests_from_url(url)

def make_requests_from_url(self, url):
    return Request(url, dont_filter=True)

What decides the order? 什么决定订单？ Why and How is this order? 为什么以及如何订购？ How can we control it? 我们怎样才能控制它？

By default, there is no pre-defined order - you cannot know when Requests from make_requests_from_url will arrive - it's asynchronous. 默认情况下，没有预定义的顺序 - 您无法知道make_requests_from_url Requests make_requests_from_url到达 - 它是异步的。

See this answer on how you may control the order. 请参阅此答案，了解如何控制订单。 Long story short, you can override start_requests and mark yielded Requests with priority key (like yield Request(url, meta={'priority': 0}) ). 简而言之，您可以覆盖start_requests并使用priority键标记已产生的Requests （如yield Request(url, meta={'priority': 0}) ）。 For example, the value of priority can be the line number where the url was found. 例如， priority的值可以是找到url的行号。

Is this a good way to deal with so many urls which are already in a file? 这是处理已经在文件中的这么多网址的好方法吗？ What else? 还有什么？

I think you should read your file and yield urls directly in start_requests method: see this answer . 我认为您应该直接在start_requests方法中读取您的文件并生成URL：请参阅此答案。

So, you should do smth like this: 所以，你应该像这样做：

def start_requests(self):
    with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
        for index, line in enumerate(f):
            try:
                url = line.strip()
                yield Request(url, meta={'priority': index})
            except:
                continue

Hope that helps. 希望有所帮助。

具有长start_urls列表的Scrapy Crawling URL的顺序和来自spider的urls yiels

问题描述

1 个解决方案

解决方案1
8 已采纳 2013-06-01 18:52:49

具有长start_urls列表的Scrapy Crawling URL的顺序和来自spider的urls yiels

问题描述

1 个解决方案

解决方案1 8 已采纳 2013-06-01 18:52:49

解决方案1
8 已采纳 2013-06-01 18:52:49