在for循環中運行多個蜘蛛

Question

我嘗試實例化多個蜘蛛。 第一個工作正常，但第二個給我一個錯誤：ReactorNotRestartable。

feeds = {
    'nasa': {
        'name': 'nasa',
        'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss',
        'start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss']
    },
    'xkcd': {
        'name': 'xkcd',
        'url': 'http://xkcd.com/rss.xml',
        'start_urls': ['http://xkcd.com/rss.xml']
    }    
}

通過上面的項目，我嘗試在循環中運行兩個蜘蛛，如下所示：

from scrapy.crawler import CrawlerProcess
from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):

    name = None

    def __init__(self, **kwargs):

        this_feed = feeds[self.name]
        self.start_urls = this_feed.get('start_urls')
        self.iterator = 'iternodes'
        self.itertag = 'items'
        super(MySpider, self).__init__(**kwargs)

def parse_node(self, response, node):
    pass


def start_crawler():
    process = CrawlerProcess({
        'USER_AGENT': CONFIG['USER_AGENT'],
        'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
    })

    for feed_name in feeds.keys():
        MySpider.name = feed_name
        process.crawl(MySpider)
        process.start()

第二個循環的例外看起來像這樣，蜘蛛打開了，但隨后：

...
2015-11-22 00:00:00 [scrapy] INFO: Spider opened
2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-21 23:54:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Traceback (most recent call last):
  File "env/bin/start_crawler", line 9, in <module>
    load_entry_point('feed-crawler==0.0.1', 'console_scripts', 'start_crawler')()
  File "/Users/bling/py-feeds-crawler/feed_crawler/crawl.py", line 51, in start_crawler
    process.start() # the script will block here until the crawling is finished
  File "/Users/bling/py-feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我是否必須使第一個MySpider無效或我做錯了什么，需要改變它的工作原理。 提前致謝。

Answer 1

看起來你必須為每個蜘蛛實例化一個進程，嘗試：

def start_crawler():      

    for feed_name in feeds.keys():
        process = CrawlerProcess({
            'USER_AGENT': CONFIG['USER_AGENT'],
            'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
        })
        MySpider.name = feed_name
        process.crawl(MySpider)
        process.start()

Answer 2

解決方案是在循環中收集蜘蛛並在結束時開始一次處理。 我的猜測，它與Reactor分配/釋放有關。

def start_crawler():

    process = CrawlerProcess({
        'USER_AGENT': CONFIG['USER_AGENT'],
        'DOWNLOAD_HANDLERS': {'s3': None} # disable for issues with boto
    })

    for feed_name in CONFIG['Feeds'].keys():
        MySpider.name = feed_name
        process.crawl(MySpider)

    process.start()

感謝@eLRuLL的回答，它激勵我找到這個解決方案。

Answer 3

您可以在爬網中發送params並在解析過程中使用它們。

class MySpider(XMLFeedSpider):
    def __init__(self, name, **kwargs):
        super(MySpider, self).__init__(**kwargs)

        self.name = name


def start_crawler():      
    process = CrawlerProcess({
        'USER_AGENT': CONFIG['USER_AGENT'],
        'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
    })

    for feed_name in feeds.keys():
        process.crawl(MySpider, feed_name)

    process.start()

在for循環中運行多個蜘蛛

問題描述

3 個解決方案

解決方案1
0 2015-11-21 23:26:20

解決方案2
0 已采納 2015-11-22 09:04:55

解決方案3
0 2017-04-19 14:13:17

在for循環中運行多個蜘蛛

問題描述

3 個解決方案

解決方案1 0 2015-11-21 23:26:20

解決方案2 0 已采納 2015-11-22 09:04:55

解決方案3 0 2017-04-19 14:13:17

解決方案1
0 2015-11-21 23:26:20

解決方案2
0 已采納 2015-11-22 09:04:55

解決方案3
0 2017-04-19 14:13:17