在scrapy中将单个蜘蛛解耦为不同的蜘蛛

Question

I'd like to decouple parsing into a different spiders.我想将解析解耦为不同的蜘蛛。

Currently I have:目前我有：

class CategoriesSpider(scrapy.Spider):
    name = 'categories'
    allowed_domains = ['example.org']
    start_urls = ['https://example.org/categories']

    def parse(self, response):
        for link in response.xpath("//div[@class='something']/@href"):
          yield scrapy.Request(response.urljoin(link.root), callback=self.parse_actual_item_i_want)

    def parse_actual_item_i_want(self, response):
          yield self.find_the_item(response)

And now split it into:现在将其拆分为：

class OneThingSpider(scrapy.Spider):
    name = 'one_thing'
    allowed_domains = ['example.org']
    start_urls = ['https://example.org/']

    def __init__(self, url: str):
        if url == None or url == "":
            raise ValueError("Invalid url given")

        # Exact URL and it's format is not known
        self.start_urls = [url]

    def parse(self, response):
          yield self.find_the_item(response)

So that if only one thing is updated I can only use OneThingSpider所以如果只更新一件事，我只能使用OneThingSpider

So where and how I call the OneThingSpider inside CategoriesSpider or pipeline?那么我在哪里以及如何在CategoriesSpider或管道中调用OneThingSpider呢？

I've tried these:我试过这些：

Inside the `CategoriesSpider` attempt #1:在`CategoriesSpider`尝试 #1 中：

    def parse(self, response):
        for link in response.xpath("//div[@class='something']/@href"):
          yield self.crawler.crawl("one_thing", response.urljoin(link.root))

In the pipeline attempt #1:在管道中尝试 #1：

In CategoriesSpider :在CategoriesSpider ：

    def parse(self, response):
        for link in response.xpath("//div[@class='something']/@href"):
          yield CategoryUrl({"url": response.urljoin(link.root)})

In pipeline:在管道中：

class MyPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, CategoryUrl):
            spider.crawler.crawl("one_thing", item["url"])
            return

        # this part works
        if isinstance(item, ItemIWant): 
            with open(...)

In the pipeline attempt #2:在管道中尝试#2：

Same change in the CategoriesSpider as in attempt #1. CategoriesSpider更改与尝试 #1 中的更改相同。 Here we simply are trying to push the URLs to the OneThingSpider after CategoriesSpider has closed.在这里，我们只是尝试在CategoriesSpider关闭后将 URL 推送到OneThingSpider 。

In pipeline:在管道中：

class MyPipeline(object):
    urls = []
    def process_item(self, item, spider):
        if isinstance(item, CategoryUrl):
            self.urls.append(item["url"])
            spider.crawler.crawl("one_thing", item["url"])
            return

        # this part works
        if isinstance(item, ItemIWant): 
            with open(...)

    def close_spider(self, spider):
        if spider.name == "categories":
            for url in self.urls:
                spider.crawler.crawl("one_thing", url)

The error I get is that the crawler is already crawling .我得到的错误是爬虫已经在爬行。 So what is the correct way for decoupling spider into smaller spiders?那么将蜘蛛解耦成更小的蜘蛛的正确方法是什么？

Goal is that I can run:目标是我可以运行：

scrapy crawl categories

and also并且

scrapy crawl one_thing -a url="https://example.org/something/xyz.html"

Answer 1

Usually you'd use class inheritence for reusing spider code in multiple spiders:通常你会使用类继承在多个蜘蛛中重用蜘蛛代码：

class BaseSpider(Spider):
    start_urls = NotImplemented
    start_body = ""

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, method='POST', body=self.start_body)

    def parse(self, response):
        raise NotImplementedError


# then children spiders
class CakeSpider(BaseSpider):
    name = 'cakes'
    start_urls = ['http://example1.com']
    start_body = '{"category": "cakes"}'

    def parse(self, response):
        # custom parser for first spider here

class VegetableSpider(BaseSpider):
    name = 'vegetables'
    start_urls = ['http://example2.com']
    start_body = '{"category": "vegetables"}'

    def parse(self, response):
        # custom parser for second spider here

As a result you'd run scrapy crawl cakes to start crawling cake category and scrapy crawl vegetables to crawl vegetable category.因此，您将运行scrapy crawl cakes以开始爬取蛋糕类别，并运行scrapy crawl vegetables来爬取蔬菜类别。

在scrapy中将单个蜘蛛解耦为不同的蜘蛛

问题描述

I've tried these:我试过这些：

Inside the `CategoriesSpider` attempt #1:在`CategoriesSpider`尝试 #1 中：

In the pipeline attempt #1:在管道中尝试 #1：

In the pipeline attempt #2:在管道中尝试#2：

1 个解决方案

解决方案1
0 已采纳 2020-01-20 11:46:17

在scrapy中将单个蜘蛛解耦为不同的蜘蛛

问题描述

I've tried these:我试过这些：

Inside the CategoriesSpider attempt #1:在CategoriesSpider尝试 #1 中：

In the pipeline attempt #1:在管道中尝试 #1：

In the pipeline attempt #2:在管道中尝试#2：

1 个解决方案

解决方案1 0 已采纳 2020-01-20 11:46:17

Inside the `CategoriesSpider` attempt #1:在`CategoriesSpider`尝试 #1 中：

解决方案1
0 已采纳 2020-01-20 11:46:17