[英]the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider
Help! 救命! Reading the source code of
Scrapy
is not easy for me. 阅读
Scrapy
的源代码对我来说并不容易。 I have a very long start_urls
list. 我有一个很长的
start_urls
列表。 it is about 3,000,000 in a file. 它在文件中约为3,000,000。 So,I make the
start_urls
like this: 所以,我像这样制作
start_urls
:
start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
for line in f:
try:
url = line.strip()
yield url
except:
print u"read line:%s from file failed!" % line
continue
print u"file read finish!"
MeanWhile, my spider's callback functions are like this: MeanWhile,我的蜘蛛的回调函数是这样的:
def parse(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.baidu.com"), callback=self.just_test1)
def just_test1(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.163.com"), callback=self.just_test2)
def just_test2(self, response):
self.log("Visited %s" % response.url)
return []
my questions are: 我的问题是:
just_test1
, just_test2
be used by downloader only after the all start_urls
are used?(I have made some tests, it seems that the answer is No) just_test1
, just_test2
的请求just_test1
仅在使用了所有start_urls
之后由下载程序使用?(我已经做了一些测试,似乎答案是否定的) Thank you very much!!! 非常感谢你!!!
Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests. 谢谢你的回答。但我仍然有点困惑: 默认情况下,Scrapy使用LIFO队列来存储待处理的请求。
requests
made by spiders' callback function will be given to the scheduler
.Who does the same thing to start_url's requests
?The spider start_requests()
function only generate an iterator without giving the real requests. requests
将被提供给scheduler
。对start_url's requests
执行相同start_url's requests
吗?spider start_requests()
函数只生成迭代器而不提供实际请求。 requests
(start_url's and callback's) be in the same request's queue?How many queues are there in Scrapy
? requests
(start_url和callback)是否都在同一个请求的队列中? Scrapy
中有多少个队列? First of all, please see this thread - I think you'll find all the answers there. 首先,请看这个主题 - 我想你会在那里找到所有的答案。
the order of the urls used by downloader?
下载器使用的网址顺序? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)
Just_test1,just_test2的请求是否仅在使用了所有start_urls之后由下载程序使用?(我已经做了一些测试,似乎答案是否定的)
You are right, the answer is No
. 你是对的,答案是
No
。 The behavior is completely asynchronous: when the spider starts, start_requests
method is called ( source ): 行为是完全异步的:当蜘蛛启动时,调用
start_requests
方法( 源 ):
def start_requests(self):
for url in self.start_urls:
yield self.make_requests_from_url(url)
def make_requests_from_url(self, url):
return Request(url, dont_filter=True)
What decides the order?
什么决定订单? Why and How is this order?
为什么以及如何订购? How can we control it?
我们怎样才能控制它?
By default, there is no pre-defined order - you cannot know when Requests
from make_requests_from_url
will arrive - it's asynchronous. 默认情况下,没有预定义的顺序 - 您无法知道
make_requests_from_url
Requests
make_requests_from_url
到达 - 它是异步的。
See this answer on how you may control the order. 请参阅此答案 ,了解如何控制订单。 Long story short, you can override
start_requests
and mark yielded Requests
with priority
key (like yield Request(url, meta={'priority': 0})
). 简而言之,您可以覆盖
start_requests
并使用priority
键标记已产生的Requests
(如yield Request(url, meta={'priority': 0})
)。 For example, the value of priority
can be the line number where the url was found. 例如,
priority
的值可以是找到url的行号。
Is this a good way to deal with so many urls which are already in a file?
这是处理已经在文件中的这么多网址的好方法吗? What else?
还有什么?
I think you should read your file and yield urls directly in start_requests
method: see this answer . 我认为您应该直接在
start_requests
方法中读取您的文件并生成URL:请参阅此答案 。
So, you should do smth like this: 所以,你应该像这样做:
def start_requests(self):
with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
for index, line in enumerate(f):
try:
url = line.strip()
yield Request(url, meta={'priority': index})
except:
continue
Hope that helps. 希望有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.