[英]How to access pipeline database pool in scrapy spider
First, here is what I'm trying to do: 首先,这是我想做的事情:
I have an XMLFeedSpider that goes through a list of products (the nodes) in an XML file and creates items that are saved to my database in a pipeline. 我有一个XMLFeedSpider,它通过XML文件中的产品列表(节点),并创建保存在管道中的项目。 The first time I see a product I need to create requests to do some scraping on the url field of the product to get images, etc. On subsequent reads of the feed if I see the same product I don't want to waste time/resources doing this and just want to skip making these extra requests.
第一次看到产品时,我需要创建请求以对产品的url字段进行一些抓取以获取图像等。在后续读取Feed时,如果我看到相同的产品,我不想浪费时间/资源,只是想跳过这些额外的请求。 To see which products to skip I need to access my database to see if the product exists.
要查看要跳过哪些产品,我需要访问我的数据库以查看该产品是否存在。
Here are various ways I could think of to do this: 这是我想到的各种方法:
dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)
and it would seem more efficient to just use that so I'm not constantly creating new database connects. dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)
,仅使用它似乎效率更高,所以我不是一直在创建新的数据库连接。 I don't know how to access the instantiated pipeline class though in my spider (that is probably more of a general python question). Update: this is my pipeline with db pool 更新:这是我的数据库池管道
class PostgresStorePipeline(object):
"""A pipeline to store the item in a MySQL database.
This implementation uses Twisted's asynchronous database API.
"""
def __init__(self, dbpool):
print "Opening connection pool..."
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbargs = dict(
host=settings['MYSQL_HOST'],
database=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
password=settings['MYSQL_PASSWD'],
#charset='utf8',
#use_unicode=True,
)
dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)
return cls(dbpool)
I think you mean URL
by item
, remember that for scrapy
an item
is a data output, and a pipeline
is a mechanism to deal with those output items. 我认为您是按
item
指URL
,请记住,对于scrapy
, item
是数据输出,而pipeline
是一种处理这些输出项的机制。
Of course you don't need to open many connections to do your db queries, but you will have to do the necessary queries. 当然,您不需要打开许多连接即可执行数据库查询,但是您将必须执行必要的查询。 It depends on how many records you have on your database to only do one query or one per
URL
, you should test which one is better on your case. 这取决于您的数据库中有多少条记录只能执行一个查询或每个
URL
进行一次查询,您应该测试哪种记录更适合您的情况。
I would recommend setting your own DUPEFILTER_CLASS
with something like: 我建议使用以下方式设置自己的
DUPEFILTER_CLASS
:
from scrapy.dupefilters import RFPDupeFilter
class DBDupeFilter(RFPDupeFilter):
def __init__(self, *args, **kwargs):
# self.cursor = ..... # instantiate your cursor
super(DBDupeFilter, self).__init__(*args, **kwargs)
def request_seen(self, request):
if self.cursor.execute("myquery"): # if exists
return True
else:
return super(DBDupeFilter, self).request_seen(request)
def close(self, reason):
self.cursor.close() # close your cursor
super(DBDupeFilter, self).close(reason)
UPDATE 更新
The problem here is that the DUPEFILTER_CLASS
doesn't offer the spider on its request_seen
object or even to get in on the constructor, so I think your best shot is with a Downloader Middleware , where you can raise a IgnoreRequest
exception. 这里的问题是
DUPEFILTER_CLASS
没有在其request_seen
对象上提供蜘蛛,甚至没有进入构造函数,因此我认为最好的选择是使用Downloader Middleware ,在其中可以引发IgnoreRequest
异常。
Instantiate the db connection on the spider, you could do this on the spider itself (the constructor), or you can also add it through a signal on the Middleware or Pipeline, we'll add it on the Middleware: 实例化Spider上的db连接,您可以在Spider本身(构造函数)上执行此操作,也可以通过中间件或管道上的信号添加它,我们将其添加到中间件上:
from scrapy.exceptions import IgnoreRequest class DBMiddleware(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): o = cls() crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def spider_opened(self, spider): spider.dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs) def process_request(self, request, spider): if spider.dbpool... # check if request.url inside the database raise IgnoreRequest()
Now on your Pipeline, remove the instantiation of dbpool
and get it from the spider
argument when necessary, remember than process_item
receives the item and the spider as argument, so you should be able to use spider.dbpool
to check your db connection. 现在,在管道上,删除
dbpool
的实例化,并在必要时从spider
参数获取它,请记住, process_item
会接收该项目和spider作为参数,因此您应该能够使用spider.dbpool
检查数据库连接。
Remember to activate your middleware . 记住要激活您的中间件 。
That way you should only be doing one instance of the db connection inside the spider object. 这样,您应该只在Spider对象内执行db连接的一个实例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.