简体   繁体   English

在scrapy中并行运行多个蜘蛛并行1个网站?

[英]Running Multiple spiders in scrapy for 1 website in parallel?

I want to crawl a website with 2 parts and my script is not as fast as I need. 我想用2个部分抓取一个网站,我的脚本没有我需要的那么快。

Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part? 是否可以发射2个蜘蛛,一个用于刮第一部分,第二个用于第二部分?

I tried to have 2 different classes, and run them 我试着有两个不同的类,然后运行它们

scrapy crawl firstSpider
scrapy crawl secondSpider

but i think that it is not smart. 但我认为这不聪明。

I read the documentation of scrapyd but I don't know if it's good for my case. 我阅读了scrapyd文档,但我不知道这对我的情况是否有益。

I think what you are looking for is something like this: 我认为你要找的是这样的:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

You can read more at: running-multiple-spiders-in-the-same-process . 您可以在以下位置阅读更多信息: 在同一个进程中运行多个蜘蛛

Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) : 或者您可以像这样运行,您需要使用scrapy.cfg将此代码保存在同一目录中(我的scrapy版本是1.3.3):

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

Better solution is (if you have multiple spiders) it dynamically get spiders and run them. 更好的解决方案是(如果你有多个蜘蛛)它动态地获取蜘蛛并运行它们。

from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks


@inlineCallbacks
def crawl():
    settings = project.get_project_settings()
    spider_loader = spiderloader.SpiderLoader.from_settings(settings)
    spiders = spider_loader.list()
    classes = [spider_loader.load(name) for name in spiders]
    for my_spider in classes:
        yield runner.crawl(my_spider)
    reactor.stop()

crawl()
reactor.run()

(Second Solution): Because spiders.list() is deprecated in Scrapy 1.4 Yuda solution should be converted to something like (第二个解决方案):因为在Scrapy 1.4中不推荐使用spiders.list() ,所以Yuda解决方案应该转换成类似的东西

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)

for spider_name in spider_loader.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name) 
process.start()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM