简体   繁体   English

如何在单独的进程中并行运行Scrapy Crawler Process? (多处理)

[英]How to run Scrapy Crawler Process parallel in separate processes? (Multiprocessing)

I am trying to do Multiprocessing of my spider . 我正在尝试对spider进行Multiprocessing I know CrawlerProcess runs the spider in a single process. 我知道CrawlerProcess在单个进程中运行蜘蛛。

I want to run multiple times the same spider with different arguments. 我想用不同的参数多次运行同一蜘蛛。

I tried this but doesn't work. 我试过了,但是没有用。

How do I do multiprocessing? 如何进行多处理?

Please do help. 请帮忙。 Thanks. 谢谢。

from scrapy.utils.project import get_project_settings
import multiprocessing
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess(settings=get_project_settings())
process.crawl(Spider, data=all_batches[0])

process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(Spider, data=all_batches[1])

p1 = multiprocessing.Process(target=process.start())
p2 = multiprocessing.Process(target=process1.start())

p1.start()
p2.start()

You need to run each scrapy crawler instance inside a separate process. 您需要在单独的进程中运行每个scrapy爬网程序实例。 This is because scrapy uses twisted , and you can't use it multiple times in the same process. 这是因为scrapy使用twisted ,并且您不能在同一过程中多次使用它。

Also, you need to disable the telenet extension, because scrapy will try to bind to the same port on multiple processes. 另外,您需要禁用scrapy扩展,因为scrapy会尝试在多个进程上绑定到同一端口。

Test code: 测试代码:

import scrapy
from multiprocessing import Process
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            print('my_data -> ', self.settings['my_data'])
            yield {'title': title.css('a ::text').get()}

def start_spider(spider, settings: dict = {}, data: dict = {}):
    all_settings = {**settings, **{'my_data': data, 'TELNETCONSOLE_ENABLED': False}}
    def crawler_func():
        crawler_process = CrawlerProcess(all_settings)
        crawler_process.crawl(spider)
        crawler_process.start()
    process = Process(target=crawler_func)
    process.start()
    return process

map(lambda x: x.join(), [
    start_spider(TestSpider, data={'data': 'test_1'}),
    start_spider(TestSpider, data={'data': 'test_2'})
])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM