简体   繁体   English

如何在Scrapy蜘蛛中获取管道对象

[英]How to get the pipeline object in Scrapy spider

I have use the mongodb to store the data of the crawl. 我使用mongodb来存储抓取的数据。

Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html) 现在我想查询数据的最后日期,我可以继续抓取数据而不需要从url列表的开头重新启动它。(url,可以通过日期确定,如:/ 2014-03-22的.html)

I want only one connection object to take the database operation, which is in pipeline. 我只想要一个连接对象来进行数据库操作,这是在管道中。

So, I want to know how can I get the pipeline object(not new one) in the spider. 所以,我想知道如何在蜘蛛中获取管道对象(而不是新对象)。

Or, any better solution for incremental update... 或者,任何更好的增量更新解决方案......

Thanks in advance. 提前致谢。

Sorry, for my poor english... Just sample now: 对不起,我的英语很差......现在就去样品:

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....
    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

And the spider: 还有蜘蛛:

class Spider(Spider):
    name = "test"
    ....

    def parse(self, response):
        # Want to get the Pipeline object
        mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
        mongo.get_date()          # In scrapy, it must have a Pipeline object for the spider
                                  # I want to get the Pipeline object, which created when scrapy started.

Ok, just don't want to new a new object....I admit I am an OCD.. 好的,只是不想新建一个新对象....我承认我是一个OCD ..

A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. Scrapy Pipeline有一个open_spider方法,在初始化蜘蛛之后执行。 You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. 您可以将对数据库连接,get_date()方法或Pipeline本身的引用传递给您的spider。 An example of the latter with your code is: 后者与您的代码的一个例子是:

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....

    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

    def open_spider(self, spider):
        spider.myPipeline = self

Then, in the spider: 然后,在蜘蛛中:

class Spider(Spider):
    name = "test"

    def __init__(self):
        self.myPipeline = None

    def parse(self, response):
        self.myPipeline.get_date()

I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization. 我不认为__init__()方法在这里是必要的,但我把它放在这里表明open_spider在初始化后替换它。

According to the scrapy Architecture Overview : 根据scrapy 架构概述

The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. 物品管道负责在物品被蜘蛛提取(或刮除)后处理物品。

Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards. 基本上这意味着,首先,scrapy蜘蛛正在工作,然后提取的物品进入管道 - 无法倒退。

One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database. 一种可能的解决方案是,在管道本身中,检查您已刮取的项目是否已存在于数据库中。

Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url. 另一种解决方法是保留您在数据库中抓取的网址列表,并在蜘蛛中检查您是否已从网址获取数据。

Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific. 既然我不确定你从“从头开始”是什么意思 - 我不能提出具体的建议。

Hope at least this information helped. 希望至少这些信息有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM