简体   繁体   English

具有长start_urls列表的Scrapy Crawling URL的顺序和来自spider的urls yiels

[英]the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider

Help! 救命! Reading the source code of Scrapy is not easy for me. 阅读Scrapy的源代码对我来说并不容易。 I have a very long start_urls list. 我有一个很长的start_urls列表。 it is about 3,000,000 in a file. 它在文件中约为3,000,000。 So,I make the start_urls like this: 所以,我像这样制作start_urls

start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
    with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
        for line in f:
            try:
                url = line.strip()
                yield url
            except:
                print u"read line:%s from file failed!" % line
                continue
    print u"file read finish!"

MeanWhile, my spider's callback functions are like this: MeanWhile,我的蜘蛛的回调函数是这样的:

  def parse(self, response):
        self.log("Visited %s" % response.url)
        return  Request(url=("http://www.baidu.com"), callback=self.just_test1)
    def just_test1(self, response):
        self.log("Visited %s" % response.url)
        return Request(url=("http://www.163.com"), callback=self.just_test2)
    def just_test2(self, response):
        self.log("Visited %s" % response.url)
        return []

my questions are: 我的问题是:

  1. the order of the urls used by downloader? 下载器使用的网址顺序? Will the requests made by just_test1 , just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No) just_test1just_test2的请求just_test1仅在使用了所有start_urls之后由下载程序使用?(我已经做了一些测试,似乎答案是否定的)
  2. What decides the order? 什么决定订单? Why and How is this order? 为什么以及如何订购? How can we control it? 我们怎样才能控制它?
  3. Is this a good way to deal with so many urls which are already in a file? 这是处理已经在文件中的这么多网址的好方法吗? What else? 还有什么?

Thank you very much!!! 非常感谢你!!!

Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests. 谢谢你的回答。但我仍然有点困惑: 默认情况下,Scrapy使用LIFO队列来存储待处理的请求。

  1. The requests made by spiders' callback function will be given to the scheduler .Who does the same thing to start_url's requests ?The spider start_requests() function only generate an iterator without giving the real requests. spiders的回调函数发出的requests将被提供给scheduler 。对start_url's requests执行相同start_url's requests吗?spider start_requests()函数只生成迭代器而不提供实际请求。
  2. Will the all requests (start_url's and callback's) be in the same request's queue?How many queues are there in Scrapy ? 所有requests (start_url和callback)是否都在同一个请求的队列中? Scrapy中有多少个队列?

First of all, please see this thread - I think you'll find all the answers there. 首先,请看这个主题 - 我想你会在那里找到所有的答案。

the order of the urls used by downloader? 下载器使用的网址顺序? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No) Just_test1,just_test2的请求是否仅在使用了所有start_urls之后由下载程序使用?(我已经做了一些测试,似乎答案是否定的)

You are right, the answer is No . 你是对的,答案是No The behavior is completely asynchronous: when the spider starts, start_requests method is called ( source ): 行为是完全异步的:当蜘蛛启动时,调用start_requests方法( ):

def start_requests(self):
    for url in self.start_urls:
        yield self.make_requests_from_url(url)

def make_requests_from_url(self, url):
    return Request(url, dont_filter=True)

What decides the order? 什么决定订单? Why and How is this order? 为什么以及如何订购? How can we control it? 我们怎样才能控制它?

By default, there is no pre-defined order - you cannot know when Requests from make_requests_from_url will arrive - it's asynchronous. 默认情况下,没有预定义的顺序 - 您无法知道make_requests_from_url Requests make_requests_from_url到达 - 它是异步的。

See this answer on how you may control the order. 请参阅此答案 ,了解如何控制订单。 Long story short, you can override start_requests and mark yielded Requests with priority key (like yield Request(url, meta={'priority': 0}) ). 简而言之,您可以覆盖start_requests并使用priority键标记已产生的Requests (如yield Request(url, meta={'priority': 0}) )。 For example, the value of priority can be the line number where the url was found. 例如, priority的值可以是找到url的行号。

Is this a good way to deal with so many urls which are already in a file? 这是处理已经在文件中的这么多网址的好方法吗? What else? 还有什么?

I think you should read your file and yield urls directly in start_requests method: see this answer . 我认为您应该直接在start_requests方法中读取您的文件并生成URL:请参阅此答案

So, you should do smth like this: 所以,你应该像这样做:

def start_requests(self):
    with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
        for index, line in enumerate(f):
            try:
                url = line.strip()
                yield Request(url, meta={'priority': index})
            except:
                continue

Hope that helps. 希望有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM