![](/img/trans.png)
[英]How to pass multiple arguments to Scrapy spider (getting error running 'scrapy crawl' with more than one spider is no longer supported)?
[英]Running more than one spider in a for loop
我嘗試實例化多個蜘蛛。 第一個工作正常,但第二個給我一個錯誤:ReactorNotRestartable。
feeds = {
'nasa': {
'name': 'nasa',
'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss',
'start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss']
},
'xkcd': {
'name': 'xkcd',
'url': 'http://xkcd.com/rss.xml',
'start_urls': ['http://xkcd.com/rss.xml']
}
}
通過上面的項目,我嘗試在循環中運行兩個蜘蛛,如下所示:
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = None
def __init__(self, **kwargs):
this_feed = feeds[self.name]
self.start_urls = this_feed.get('start_urls')
self.iterator = 'iternodes'
self.itertag = 'items'
super(MySpider, self).__init__(**kwargs)
def parse_node(self, response, node):
pass
def start_crawler():
process = CrawlerProcess({
'USER_AGENT': CONFIG['USER_AGENT'],
'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
})
for feed_name in feeds.keys():
MySpider.name = feed_name
process.crawl(MySpider)
process.start()
第二個循環的例外看起來像這樣,蜘蛛打開了,但隨后:
...
2015-11-22 00:00:00 [scrapy] INFO: Spider opened
2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-21 23:54:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Traceback (most recent call last):
File "env/bin/start_crawler", line 9, in <module>
load_entry_point('feed-crawler==0.0.1', 'console_scripts', 'start_crawler')()
File "/Users/bling/py-feeds-crawler/feed_crawler/crawl.py", line 51, in start_crawler
process.start() # the script will block here until the crawling is finished
File "/Users/bling/py-feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
我是否必須使第一個MySpider無效或我做錯了什么,需要改變它的工作原理。 提前致謝。
看起來你必須為每個蜘蛛實例化一個進程,嘗試:
def start_crawler():
for feed_name in feeds.keys():
process = CrawlerProcess({
'USER_AGENT': CONFIG['USER_AGENT'],
'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
})
MySpider.name = feed_name
process.crawl(MySpider)
process.start()
解決方案是在循環中收集蜘蛛並在結束時開始一次處理。 我的猜測,它與Reactor分配/釋放有關。
def start_crawler():
process = CrawlerProcess({
'USER_AGENT': CONFIG['USER_AGENT'],
'DOWNLOAD_HANDLERS': {'s3': None} # disable for issues with boto
})
for feed_name in CONFIG['Feeds'].keys():
MySpider.name = feed_name
process.crawl(MySpider)
process.start()
感謝@eLRuLL的回答,它激勵我找到這個解決方案。
您可以在爬網中發送params並在解析過程中使用它們。
class MySpider(XMLFeedSpider):
def __init__(self, name, **kwargs):
super(MySpider, self).__init__(**kwargs)
self.name = name
def start_crawler():
process = CrawlerProcess({
'USER_AGENT': CONFIG['USER_AGENT'],
'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
})
for feed_name in feeds.keys():
process.crawl(MySpider, feed_name)
process.start()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.