简体   繁体   English

有没有方法为每只蜘蛛使用单独的scrapy管道?

[英]Is there any method to using seperate scrapy pipeline for each spider?

I wanna to fetch web pages under different domain, that means I have to use different spider under the command "scrapy crawl myspider". 我想在不同的域下获取网页,这意味着我必须在命令“scrapy crawl myspider”下使用不同的蜘蛛。 However, I have to use different pipeline logic to put the data into database since the content of web pages are different. 但是,由于网页内容不同,我必须使用不同的管道逻辑将数据放入数据库。 But for every spider, they have to go through all of the pipelines which defined in settings.py. 但对于每个蜘蛛,它们必须遍历settings.py中定义的所有管道。 Is there have other elegant method to using seperate pipelines for each spider? 是否有其他优雅的方法为每个蜘蛛使用单独的管道?

ITEM_PIPELINES setting is defined globally for all spiders in the project during the engine start. ITEM_PIPELINES设置是在引擎启动期间为项目中的所有蜘蛛全局定义的。 It cannot be changed per spider on the fly. 它不能在飞行中每个蜘蛛改变。

Here are some options to consider: 以下是一些需要考虑的选项:

  • Change the code of pipelines. 更改管道代码。 Skip/continue processing items returned by spiders in the process_item method of your pipeline, eg: 跳过/继续处理由管道的process_item方法中的蜘蛛返回的项目,例如:

     def process_item(self, item, spider): if spider.name not in ['spider1', 'spider2']: return item # process item 
  • Change the way you start crawling. 更改开始抓取的方式。 Do it from a script , based on spider name passed as a parameter, override your ITEM_PIPELINES setting before calling crawler.configure() . 根据作为参数传递的蜘蛛名称, 从脚本执行此操作在调用crawler.configure()之前覆盖您的ITEM_PIPELINES设置。

See also: 也可以看看:

Hope that helps. 希望有所帮助。

A slightly better version of the above is as follows. 上面稍微好一点的版本如下。 It is better because this way allows you to selectively turn on pipelines for different spiders more easily than the coding of 'not in ['spider1','spider2']' in the pipeline above. 它更好,因为这种方式允许您比上面管道中的'not in ['spider1','spider2']'编码更容易选择性地为不同的蜘蛛开启管道

In your spider class, add: 在你的蜘蛛类中,添加:

#start_urls=...
pipelines = ['pipeline1', 'pipeline2'] #allows you to selectively turn on pipelines within spiders
#...

Then in each pipeline, you can use the getattr method as magic. 然后在每个管道中,您可以使用getattr方法作为魔法。 Add: 加:

class pipeline1():  
    def process_item(self, item, spider):
       if 'pipeline1' not in getattr(spider, 'pipelines'):
          return item
       #...keep going as normal  

A more robust solution; 更强大的解决方案; Can't remember where I found it but a scrapy dev proposed it somewhere.. Using this method allows you to have some pipeline run on all spiders by not using the wrapper. 不记得我发现它的位置,但scrapy dev在某处提出了它。使用这种方法可以让你通过不使用包装器在所有蜘蛛上运行一些管道。 It also makes it so you don't have to duplicate the logic of checking whether or not to use the pipeline. 它也使得您不必复制检查是否使用管道的逻辑。

Wrapper: 包装:

def check_spider_pipeline(process_item_method):
    """
        This wrapper makes it so pipelines can be turned on and off at a spider level.
    """
    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):
        if self.__class__ in spider.pipeline:
            return process_item_method(self, item, spider)
        else:
            return item

    return wrapper

Usage: 用法:

@check_spider_pipeline
def process_item(self, item, spider):
    ........
    ........
    return item

Spider usage: 蜘蛛用法:

pipeline = {some.pipeline, some.other.pipeline .....}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM