[英]How to run Scrapy Crawler Process parallel in separate processes? (Multiprocessing)
我正在嘗試對spider
進行Multiprocessing
。 我知道CrawlerProcess
在單個進程中運行蜘蛛。
我想用不同的參數多次運行同一蜘蛛。
我試過了,但是沒有用。
如何進行多處理?
請幫忙。 謝謝。
from scrapy.utils.project import get_project_settings
import multiprocessing
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings=get_project_settings())
process.crawl(Spider, data=all_batches[0])
process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(Spider, data=all_batches[1])
p1 = multiprocessing.Process(target=process.start())
p2 = multiprocessing.Process(target=process1.start())
p1.start()
p2.start()
您需要在單獨的進程中運行每個scrapy
爬網程序實例。 這是因為scrapy
使用twisted ,並且您不能在同一過程中多次使用它。
另外,您需要禁用scrapy
擴展,因為scrapy
會嘗試在多個進程上綁定到同一端口。
測試代碼:
import scrapy
from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
print('my_data -> ', self.settings['my_data'])
yield {'title': title.css('a ::text').get()}
def start_spider(spider, settings: dict = {}, data: dict = {}):
all_settings = {**settings, **{'my_data': data, 'TELNETCONSOLE_ENABLED': False}}
def crawler_func():
crawler_process = CrawlerProcess(all_settings)
crawler_process.crawl(spider)
crawler_process.start()
process = Process(target=crawler_func)
process.start()
return process
map(lambda x: x.join(), [
start_spider(TestSpider, data={'data': 'test_1'}),
start_spider(TestSpider, data={'data': 'test_2'})
])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.