[英]item pipeline not working in scrapy
I have written the following code and I found that item pipeline will not work if I write in the following way, the process_item
(in item pipeline) will not be executed. 我已经编写了以下代码,但我发现如果按照以下方式编写,则项目管道将无法工作,(项目管道中的)
process_item
将不会执行。
class Spider(scrapy.Spider):
name = “***”
def __init__(self, url='http://example.com/', **kw):
super(Spider,self).__init__(**kw)
self.url = url
self.allowed_domains = [re.sub(r'^www\.', '', urlparse(url).hostname)]
def start_requests(self):
#return [Request(self.url, callback=self.parse, dont_filter=False)]
return [Request(self.url, callback=self.find_all_url, dont_filter=False)]
def find_all_url(self,response):
log.msg('current url: '+response.url, level=log.DEBUG)
if True:
self.parse(response)
def parse(self, response):
dept = deptItem()
dept['deptName'] = response.xpath('//title/text()').extract()[0].strip()
dept['url'] = response.url
log.msg('find an item: '+ str(response.url) +'\n going to return item' , level = log.INFO)
return dept
However, if I change the callback in start_requests from self.find_all_url
to self.parse
(see above the commented code), the item pipeline works, I try to find out why, but I couldn't, anyone can help? 但是,如果我将start_requests中的回调从
self.find_all_url
为self.parse
(请参见上面的注释代码),则该项目管道有效,我试图找出原因,但我不能,任何人都可以帮助您?
I have found out that if I want to write in this way, I need to add return
in front of self.parse(response)
in function find_all_url
. 我发现,如果我想用这种方式来写,我需要补充
return
面前self.parse(response)
函数find_all_url
。
But I am not very clear why this is the case, I guess the returned item should eventually return to the initial requests? 但是我不是很清楚为什么会这样,我想退回的物品最终应该返回到最初的请求吗?
Can you post your settings? 您可以发布设置吗?
You must define the pipelines in settings.py 您必须在settings.py中定义管道
ITEM_PIPELINES = {
'MySpider.pipelines.SomePipeline': 300,
}
Basic example: https://github.com/scrapy/dirbot 基本示例: https : //github.com/scrapy/dirbot
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.