如何在单独的进程中并行运行Scrapy Crawler Process？（多处理）

Question

I am trying to do Multiprocessing of my spider . 我正在尝试对spider进行Multiprocessing 。 I know CrawlerProcess runs the spider in a single process. 我知道CrawlerProcess在单个进程中运行蜘蛛。

I want to run multiple times the same spider with different arguments. 我想用不同的参数多次运行同一蜘蛛。

I tried this but doesn't work. 我试过了，但是没有用。

How do I do multiprocessing? 如何进行多处理？

Please do help. 请帮忙。 Thanks. 谢谢。

from scrapy.utils.project import get_project_settings
import multiprocessing
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess(settings=get_project_settings())
process.crawl(Spider, data=all_batches[0])

process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(Spider, data=all_batches[1])

p1 = multiprocessing.Process(target=process.start())
p2 = multiprocessing.Process(target=process1.start())

p1.start()
p2.start()

Answer 1

You need to run each scrapy crawler instance inside a separate process. 您需要在单独的进程中运行每个scrapy爬网程序实例。 This is because scrapy uses twisted , and you can't use it multiple times in the same process. 这是因为scrapy使用twisted ，并且您不能在同一过程中多次使用它。

Also, you need to disable the telenet extension, because scrapy will try to bind to the same port on multiple processes. 另外，您需要禁用scrapy扩展，因为scrapy会尝试在多个进程上绑定到同一端口。

Test code: 测试代码：

import scrapy
from multiprocessing import Process
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            print('my_data -> ', self.settings['my_data'])
            yield {'title': title.css('a ::text').get()}

def start_spider(spider, settings: dict = {}, data: dict = {}):
    all_settings = {**settings, **{'my_data': data, 'TELNETCONSOLE_ENABLED': False}}
    def crawler_func():
        crawler_process = CrawlerProcess(all_settings)
        crawler_process.crawl(spider)
        crawler_process.start()
    process = Process(target=crawler_func)
    process.start()
    return process

map(lambda x: x.join(), [
    start_spider(TestSpider, data={'data': 'test_1'}),
    start_spider(TestSpider, data={'data': 'test_2'})
])

如何在单独的进程中并行运行Scrapy Crawler Process？（多处理）

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-24 14:15:28

如何在单独的进程中并行运行Scrapy Crawler Process？ （多处理）

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-24 14:15:28

如何在单独的进程中并行运行Scrapy Crawler Process？（多处理）

解决方案1
1 已采纳 2019-08-24 14:15:28