將scrapy蜘蛛打造成我自己的程序，我不想從命令行調用scrapy）

Question

與此問題類似： stackoverflow：run-multiple-spiders-in-scrapy

我想知道，我可以在另一個python程序中運行整個scrapy項目嗎？ 讓我們說我想構建一個需要抓取幾個不同站點的整個程序，並為每個站點構建整個scrapy項目。

而不是從命令行運行，我想運行這些蜘蛛並從中獲取信息。

我可以在python中使用mongoDB，我已經可以構建包含蜘蛛的scrapy項目，但現在只需將它們合並到一個應用程序中。

我想運行一次應用程序，並且能夠從我自己的程序中控制多個蜘蛛

為什么這樣？ 以及此應用程序還可以使用API連接到其他站點，並且需要實時比較API站點和已刪除站點的結果。 我不想從命令行調用scrapy，它的意思是自包含。

（我最近一直在問很多關於抓取的問題，因為我正在努力尋找合適的解決方案來構建）

謝謝：）

Answer 1

是的，你當然可以;）

這個想法（靈感來自這篇博文）是創建一個worker然后在你自己的Python腳本中使用它：

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing

class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)

使用示例：

result_queue = Queue()
crawler = CrawlerWorker(MySpider(myArgs), result_queue)
crawler.start()
for item in result_queue.get():
    yield item

另一種方法是使用system()執行scrapy crawl命令

Answer 2

Maxime Lorant的答案終於解決了我在自己的劇本中制作scrapy蜘蛛的問題。 它解決了我遇到的兩個問題：

它允許連續兩次調用蜘蛛（在scrapy教程中的簡單示例中，這會導致崩潰，因為您無法啟動twister reactor兩次）
它允許將變量從spider返回到腳本中。

只有一件事：這個例子不適用於我現在使用的scrapy版本（Scrapy 1.5.2）和Python 3.7

在玩了一些代碼后，我得到了一個我想分享的工作示例。 我也有一個問題，請參見下面的腳本。 它是一個獨立的腳本，所以我也添加了一個蜘蛛

import logging
import multiprocessing as mp

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.signals import item_passed
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher


class CrawlerWorker(mp.Process):
    name = "crawlerworker"

    def __init__(self, spider, result_queue):
        mp.Process.__init__(self)
        self.result_queue = result_queue
        self.items = list()
        self.spider = spider
        self.logger = logging.getLogger(self.name)

        self.settings = get_project_settings()
        self.logger.setLevel(logging.DEBUG)
        self.logger.debug("Create CrawlerProcess with settings {}".format(self.settings))
        self.crawler = CrawlerProcess(self.settings)

        dispatcher.connect(self._item_passed, item_passed)

    def _item_passed(self, item):
        self.logger.debug("Adding Item {} to {}".format(item, self.items))
        self.items.append(item)

    def run(self):
        self.logger.info("Start here with {}".format(self.spider.urls))
        self.crawler.crawl(self.spider, urls=self.spider.urls)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def __init__(self, **kw):
        super(QuotesSpider, self).__init__(**kw)

        self.urls = kw.get("urls", [])

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)
        else:
            self.log('Nothing to scrape. Please pass the urls')

    def parse(self, response):
        """ Count number of The's on the page """
        the_count = len(response.xpath("//body//text()").re(r"The\s"))
        self.log("found {} time 'The'".format(the_count))
        yield {response.url: the_count}


def report_items(message, item_list):
    print(message)
    if item_list:
        for cnt, item in enumerate(item_list):
            print("item {:2d}: {}".format(cnt, item))
    else:
        print(f"No items found")


url_list = [
    'http://quotes.toscrape.com/page/1/',
    'http://quotes.toscrape.com/page/2/',
    'http://quotes.toscrape.com/page/3/',
    'http://quotes.toscrape.com/page/4/',
]

result_queue1 = mp.Queue()
crawler = CrawlerWorker(QuotesSpider(urls=url_list[:2]), result_queue1)
crawler.start()
# wait until we are done with the crawl
crawler.join()

# crawl again
result_queue2 = mp.Queue()
crawler = CrawlerWorker(QuotesSpider(urls=url_list[2:]), result_queue2)
crawler.start()
crawler.join()
#
report_items("First result", result_queue1.get())
report_items("Second result", result_queue2.get())

如您所見，代碼幾乎完全相同，除了一些導入由於scrapy API的更改而發生更改。

有一件事：我通過pydistatch導入得到了一個棄用警告：

 ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762
  module = self._system_import(name, *args, **kwargs)

我在這里找到了如何解決這個問題。 但是，我不能讓這個工作。 有誰知道如何應用from_crawler類方法來擺脫棄用警告？

將scrapy蜘蛛打造成我自己的程序，我不想從命令行調用scrapy）

問題描述

2 個解決方案

解決方案1
8 已采納 2012-06-28 10:28:01

解決方案2
0 2019-03-08 15:49:35

將scrapy蜘蛛打造成我自己的程序，我不想從命令行調用scrapy）

問題描述

2 個解決方案

解決方案1 8 已采納 2012-06-28 10:28:01

解決方案2 0 2019-03-08 15:49:35

解決方案1
8 已采納 2012-06-28 10:28:01

解決方案2
0 2019-03-08 15:49:35