简体   繁体   English

在scrapy中将单个蜘蛛解耦为不同的蜘蛛

[英]Decoupling single spider into different spiders in scrapy

I'd like to decouple parsing into a different spiders.我想将解析解耦为不同的蜘蛛。

Currently I have:目前我有:

class CategoriesSpider(scrapy.Spider):
    name = 'categories'
    allowed_domains = ['example.org']
    start_urls = ['https://example.org/categories']

    def parse(self, response):
        for link in response.xpath("//div[@class='something']/@href"):
          yield scrapy.Request(response.urljoin(link.root), callback=self.parse_actual_item_i_want)

    def parse_actual_item_i_want(self, response):
          yield self.find_the_item(response)

And now split it into:现在将其拆分为:

class OneThingSpider(scrapy.Spider):
    name = 'one_thing'
    allowed_domains = ['example.org']
    start_urls = ['https://example.org/']

    def __init__(self, url: str):
        if url == None or url == "":
            raise ValueError("Invalid url given")

        # Exact URL and it's format is not known
        self.start_urls = [url]

    def parse(self, response):
          yield self.find_the_item(response)

So that if only one thing is updated I can only use OneThingSpider所以如果只更新一件事,我只能使用OneThingSpider

So where and how I call the OneThingSpider inside CategoriesSpider or pipeline?那么我在哪里以及如何在CategoriesSpider或管道中调用OneThingSpider呢?

I've tried these:我试过这些:

Inside the CategoriesSpider attempt #1:CategoriesSpider尝试 #1 中:

    def parse(self, response):
        for link in response.xpath("//div[@class='something']/@href"):
          yield self.crawler.crawl("one_thing", response.urljoin(link.root))

In the pipeline attempt #1:在管道中尝试 #1:

In CategoriesSpider :CategoriesSpider

    def parse(self, response):
        for link in response.xpath("//div[@class='something']/@href"):
          yield CategoryUrl({"url": response.urljoin(link.root)})

In pipeline:在管道中:

class MyPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, CategoryUrl):
            spider.crawler.crawl("one_thing", item["url"])
            return

        # this part works
        if isinstance(item, ItemIWant): 
            with open(...)

In the pipeline attempt #2:在管道中尝试#2:

Same change in the CategoriesSpider as in attempt #1. CategoriesSpider更改与尝试 #1 中的更改相同。 Here we simply are trying to push the URLs to the OneThingSpider after CategoriesSpider has closed.在这里,我们只是尝试在CategoriesSpider关闭后将 URL 推送到OneThingSpider

In pipeline:在管道中:

class MyPipeline(object):
    urls = []
    def process_item(self, item, spider):
        if isinstance(item, CategoryUrl):
            self.urls.append(item["url"])
            spider.crawler.crawl("one_thing", item["url"])
            return

        # this part works
        if isinstance(item, ItemIWant): 
            with open(...)

    def close_spider(self, spider):
        if spider.name == "categories":
            for url in self.urls:
                spider.crawler.crawl("one_thing", url)

The error I get is that the crawler is already crawling .我得到的错误是爬虫已经在爬行 So what is the correct way for decoupling spider into smaller spiders?那么将蜘蛛解耦成更小的蜘蛛的正确方法是什么?

Goal is that I can run:目标是我可以运行:

scrapy crawl categories

and also并且

scrapy crawl one_thing -a url="https://example.org/something/xyz.html"

Usually you'd use class inheritence for reusing spider code in multiple spiders:通常你会使用类继承在多个蜘蛛中重用蜘蛛代码:

class BaseSpider(Spider):
    start_urls = NotImplemented
    start_body = ""

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, method='POST', body=self.start_body)

    def parse(self, response):
        raise NotImplementedError


# then children spiders
class CakeSpider(BaseSpider):
    name = 'cakes'
    start_urls = ['http://example1.com']
    start_body = '{"category": "cakes"}'

    def parse(self, response):
        # custom parser for first spider here

class VegetableSpider(BaseSpider):
    name = 'vegetables'
    start_urls = ['http://example2.com']
    start_body = '{"category": "vegetables"}'

    def parse(self, response):
        # custom parser for second spider here

As a result you'd run scrapy crawl cakes to start crawling cake category and scrapy crawl vegetables to crawl vegetable category.因此,您将运行scrapy crawl cakes以开始爬取蛋糕类别,并运行scrapy crawl vegetables来爬取蔬菜类别。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM