在scrapy中并行运行多个蜘蛛并行1个网站？

Question

I want to crawl a website with 2 parts and my script is not as fast as I need. 我想用2个部分抓取一个网站，我的脚本没有我需要的那么快。

Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part? 是否可以发射2个蜘蛛，一个用于刮第一部分，第二个用于第二部分？

I tried to have 2 different classes, and run them 我试着有两个不同的类，然后运行它们

scrapy crawl firstSpider
scrapy crawl secondSpider

but i think that it is not smart. 但我认为这不聪明。

I read the documentation of scrapyd but I don't know if it's good for my case. 我阅读了scrapyd的文档，但我不知道这对我的情况是否有益。

Answer 1

I think what you are looking for is something like this: 我认为你要找的是这样的：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

You can read more at: running-multiple-spiders-in-the-same-process . 您可以在以下位置阅读更多信息：在同一个进程中运行多个蜘蛛。

Answer 2

Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) : 或者您可以像这样运行，您需要使用scrapy.cfg将此代码保存在同一目录中（我的scrapy版本是1.3.3）：

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

Answer 3

Better solution is (if you have multiple spiders) it dynamically get spiders and run them. 更好的解决方案是（如果你有多个蜘蛛）它动态地获取蜘蛛并运行它们。

from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks


@inlineCallbacks
def crawl():
    settings = project.get_project_settings()
    spider_loader = spiderloader.SpiderLoader.from_settings(settings)
    spiders = spider_loader.list()
    classes = [spider_loader.load(name) for name in spiders]
    for my_spider in classes:
        yield runner.crawl(my_spider)
    reactor.stop()

crawl()
reactor.run()

(Second Solution): Because spiders.list() is deprecated in Scrapy 1.4 Yuda solution should be converted to something like （第二个解决方案）：因为在Scrapy 1.4中不推荐使用spiders.list() ，所以Yuda解决方案应该转换成类似的东西

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)

for spider_name in spider_loader.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name) 
process.start()

在scrapy中并行运行多个蜘蛛并行1个网站？

问题描述

3 个解决方案

解决方案1
8 已采纳 2016-09-07 10:03:36

解决方案2
5 2017-05-11 23:23:37

解决方案3
3 2017-10-22 06:52:36

在scrapy中并行运行多个蜘蛛并行1个网站？

问题描述

3 个解决方案

解决方案1 8 已采纳 2016-09-07 10:03:36

解决方案2 5 2017-05-11 23:23:37

解决方案3 3 2017-10-22 06:52:36

解决方案1
8 已采纳 2016-09-07 10:03:36

解决方案2
5 2017-05-11 23:23:37

解决方案3
3 2017-10-22 06:52:36