![](/img/trans.png)
[英]How to yield nested items populated from multiple pages / multipled parsers only once in scrapy
[英]Python Scrapy nested pages only need items from innermost page
我在具有嵌套頁面的網站上練習scrapy,我只需要抓取最內層頁面的內容,但是有一種方法可以使用許多解析函數將數據從parse函數解析到最內層頁面到主parse函數打開頁面,但僅從上一個解析函數中獲取項目,並轉移到主解析函數
這是我嘗試過的
try:
import scrapy
from urlparse import urljoin
except ImportError:
print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"
class CanadaSpider(scrapy.Spider):
name = 'CananaSpider'
start_urls = ['http://www.canada411.ca']
#PAGE 1 OF THE NESTED WEBSITE GETTING LINK AND JOING WITH THE MAIN LINK AND VISITING THE PAGE
def parse(self, response):
SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
for PHONE in response.css(SET_SELECTOR):
selector = 'a ::attr(href)'
try:
momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())
#PASSING A DICTIONARYAS THE ITEM
pre = {}
post = scrapy.Request(momo, callback=self.parse_pre1, meta={'item': pre})
yield pre
except:
pass
#PAGE 2 OF THE NESTED WEBSITE
def parse_pre1(self, response):
#RETURNING THE SAME ITEM
item = response.meta["item"]
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for PHONE in response.css(SET_SELECTOR):
selector = 'a ::attr(href)'
momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())
pre = scrapy.Request(momo, callback=self.parse_pre1, meta={'page_2': item})
yield pre
def parse_info(self, response):
#HERE I AM SCRAPING THE DATA
item = response.meta["page_2"]
name = '.vcard__name'
address = '.c411Address.vcard__address'
ph = '.vcard.label'
item['name'] = response.css(name).extract_first()
item['address'] = response.css(address).extract_first()
item['phoneno'] = response.css(ph).extract_first()
return item
我繼承了該物品我在做什么錯?
在parse
您在post
實例中產生pre
同時,您也應該使用Scrapy.Item
類,而不是dict。
def parse(self, response):
SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
for PHONE in response.css(SET_SELECTOR):
selector = 'a ::attr(href)'
try:
momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())
#PASSING A DICTIONARYAS THE ITEM
pre = {} # This should be an instance of Scrapy.Item
post = scrapy.Request(momo, callback=self.parse_pre1, meta={'item': pre})
yield post
except:
pass
在parse_pre1
您再次設置為回調parse_pre1
,我認為您的意思是parse_info
def parse_pre1(self, response):
#RETURNING THE SAME ITEM
item = response.meta["item"]
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for PHONE in response.css(SET_SELECTOR):
selector = 'a ::attr(href)'
momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())
pre = scrapy.Request(momo, callback=self.parse_info, meta={'page_2': item})
yield pre
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.