简体   繁体   English

如何在Scrapy Spider中访问管道数据库池

[英]How to access pipeline database pool in scrapy spider

First, here is what I'm trying to do: 首先,这是我想做的事情:

I have an XMLFeedSpider that goes through a list of products (the nodes) in an XML file and creates items that are saved to my database in a pipeline. 我有一个XMLFeedSpider,它通过XML文件中的产品列表(节点),并创建保存在管道中的项目。 The first time I see a product I need to create requests to do some scraping on the url field of the product to get images, etc. On subsequent reads of the feed if I see the same product I don't want to waste time/resources doing this and just want to skip making these extra requests. 第一次看到产品时,我需要创建请求以对产品的url字段进行一些抓取以获取图像等。在后续读取Feed时,如果我看到相同的产品,我不想浪费时间/资源,只是想跳过这些额外的请求。 To see which products to skip I need to access my database to see if the product exists. 要查看要跳过哪些产品,我需要访问我的数据库以查看该产品是否存在。

Here are various ways I could think of to do this: 这是我想到的各种方法:

  1. Just create a db request for each product within the spider. 只需为Spider中的每个产品创建一个数据库请求。 This seems like a bad idea. 这似乎是个坏主意。
  2. In my item store pipeline I'm already creating a database pool as follows: dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs) and it would seem more efficient to just use that so I'm not constantly creating new database connects. 在我的商品存储管道中,我已经按如下所示创建数据库池: dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs) ,仅使用它似乎效率更高,所以我不是一直在创建新的数据库连接。 I don't know how to access the instantiated pipeline class though in my spider (that is probably more of a general python question). 我不知道如何在蜘蛛中访问实例化的管道类(这可能更多是一个通用的python问题)。
    Note: this guy is basically asking this same question but didn't really get the answer he was looking for. 注意:这个人基本上是在问同样的问题,但并没有真正得到他想要的答案。 How to get the pipeline object in Scrapy spider 如何在Scrapy Spider中获取管道对象
  3. Maybe before starting the crawl load all of the product urls into memory so I can compare them when processing the products? 也许在开始爬网之前,将所有产品url加载到内存中,以便在处理产品时可以进行比较? Where would be a good place to do this? 在哪里做这件事的好地方?
  4. Other suggestion? 还有其他建议吗?

Update: this is my pipeline with db pool 更新:这是我的数据库池管道

class PostgresStorePipeline(object):
    """A pipeline to store the item in a MySQL database.
    This implementation uses Twisted's asynchronous database API.
    """

    def __init__(self, dbpool):
        print "Opening connection pool..."
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbargs = dict(
            host=settings['MYSQL_HOST'],
            database=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            password=settings['MYSQL_PASSWD'],
            #charset='utf8',
            #use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)
        return cls(dbpool)

I think you mean URL by item , remember that for scrapy an item is a data output, and a pipeline is a mechanism to deal with those output items. 我认为您是按itemURL ,请记住,对于scrapyitem是数据输出,而pipeline是一种处理这些输出项的机制。

Of course you don't need to open many connections to do your db queries, but you will have to do the necessary queries. 当然,您不需要打开许多连接即可执行数据库查询,但是您将必须执行必要的查询。 It depends on how many records you have on your database to only do one query or one per URL , you should test which one is better on your case. 这取决于您的数据库中有多少条记录只能执行一个查询或每个URL进行一次查询,您应该测试哪种记录更适合您的情况。

I would recommend setting your own DUPEFILTER_CLASS with something like: 我建议使用以下方式设置自己的DUPEFILTER_CLASS

from scrapy.dupefilters import RFPDupeFilter

class DBDupeFilter(RFPDupeFilter):

    def __init__(self, *args, **kwargs):
        # self.cursor = .....                       # instantiate your cursor
        super(DBDupeFilter, self).__init__(*args, **kwargs)

    def request_seen(self, request):
        if self.cursor.execute("myquery"):          # if exists
            return True
        else:
            return super(DBDupeFilter, self).request_seen(request)

    def close(self, reason):
        self.cursor.close()                         # close  your cursor
        super(DBDupeFilter, self).close(reason)

UPDATE 更新

The problem here is that the DUPEFILTER_CLASS doesn't offer the spider on its request_seen object or even to get in on the constructor, so I think your best shot is with a Downloader Middleware , where you can raise a IgnoreRequest exception. 这里的问题是DUPEFILTER_CLASS没有在其request_seen对象上提供蜘蛛,甚至没有进入构造函数,因此我认为最好的选择是使用Downloader Middleware ,在其中可以引发IgnoreRequest异常。

  1. Instantiate the db connection on the spider, you could do this on the spider itself (the constructor), or you can also add it through a signal on the Middleware or Pipeline, we'll add it on the Middleware: 实例化Spider上的db连接,您可以在Spider本身(构造函数)上执行此操作,也可以通过中间件或管道上的信号添加它,我们将其添加到中间件上:

     from scrapy.exceptions import IgnoreRequest class DBMiddleware(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): o = cls() crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def spider_opened(self, spider): spider.dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs) def process_request(self, request, spider): if spider.dbpool... # check if request.url inside the database raise IgnoreRequest() 
  2. Now on your Pipeline, remove the instantiation of dbpool and get it from the spider argument when necessary, remember than process_item receives the item and the spider as argument, so you should be able to use spider.dbpool to check your db connection. 现在,在管道上,删除dbpool的实例化,并在必要时从spider参数获取它,请记住, process_item会接收该项目和spider作为参数,因此您应该能够使用spider.dbpool检查数据库连接。

  3. Remember to activate your middleware . 记住要激活您的中间件

That way you should only be doing one instance of the db connection inside the spider object. 这样,您应该只在Spider对象内执行db连接的一个实例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM