[英]How to run Scrapy Crawler Process parallel in separate processes? (Multiprocessing)
I am trying to do Multiprocessing
of my spider
. 我正在尝试对
spider
进行Multiprocessing
。 I know CrawlerProcess
runs the spider in a single process. 我知道
CrawlerProcess
在单个进程中运行蜘蛛。
I want to run multiple times the same spider with different arguments. 我想用不同的参数多次运行同一蜘蛛。
I tried this but doesn't work. 我试过了,但是没有用。
How do I do multiprocessing? 如何进行多处理?
Please do help. 请帮忙。 Thanks.
谢谢。
from scrapy.utils.project import get_project_settings
import multiprocessing
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings=get_project_settings())
process.crawl(Spider, data=all_batches[0])
process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(Spider, data=all_batches[1])
p1 = multiprocessing.Process(target=process.start())
p2 = multiprocessing.Process(target=process1.start())
p1.start()
p2.start()
You need to run each scrapy
crawler instance inside a separate process. 您需要在单独的进程中运行每个
scrapy
爬网程序实例。 This is because scrapy
uses twisted , and you can't use it multiple times in the same process. 这是因为
scrapy
使用twisted ,并且您不能在同一过程中多次使用它。
Also, you need to disable the telenet extension, because scrapy
will try to bind to the same port on multiple processes. 另外,您需要禁用
scrapy
扩展,因为scrapy
会尝试在多个进程上绑定到同一端口。
Test code: 测试代码:
import scrapy
from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
print('my_data -> ', self.settings['my_data'])
yield {'title': title.css('a ::text').get()}
def start_spider(spider, settings: dict = {}, data: dict = {}):
all_settings = {**settings, **{'my_data': data, 'TELNETCONSOLE_ENABLED': False}}
def crawler_func():
crawler_process = CrawlerProcess(all_settings)
crawler_process.crawl(spider)
crawler_process.start()
process = Process(target=crawler_func)
process.start()
return process
map(lambda x: x.join(), [
start_spider(TestSpider, data={'data': 'test_1'}),
start_spider(TestSpider, data={'data': 'test_2'})
])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.