简体   繁体   中英

Scrapy callback function

I have a basic scrapy script that's doing the following:

  1. Visting a website
  2. Using a rule to get all pages:

     rules = ( Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[@id="pagination_top"]/a',)), callback="parse_page", follow= True), )
  3. Within each page, getting all links to prod pages:

     def parse_page(self, response): for href in response.css("#prod_category > ul > li > a::attr('href')"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_dir_contents)
  4. and visiting each of the product pages to get details about the product. I then get additional details from a different link

    def parse_dir_contents(self, response): # select xpath here print '________________________BEGIN PRODUCT________________________' item = detailedItem() item['title'] = sites.xpath('//*[@id="product-name"]/text()').extract() # get url_2 from this page request = scrapy.Request(url_2, callback=self.parse_detailed_contents) request.meta['item'] = item yield request
  5. Finally here's the function that gets detailed information about the product

    I think this last parse_detailed_contents is where I have an issue

    def parse_detailed_contents(self, response): item = response.meta['item'] sel = Selector(response) sites = sel.xpath('//*[@id="prod-details"]') print '________________________GETTING DETAILS________________________' item['prod_details'] = sites.xpath('//*[@id="prod-details"]/div/text()').extract() return item

The problem is that my script returns item['prod_details'] for the first link but does not return any of the items for subsequent links.

Is that because url_2 being passed in the same for all product?

Could someone please help. Thanks a lot in advance!

try adding dont_filter=True

def parse_dir_contents(self, response):
 # select xpath here
 print '________________________BEGIN PRODUCT________________________'
 item = detailedItem()
 item['title'] = sites.xpath('//*[@id="product-name"]/text()').extract()

 # get url_2 from this page

 request = scrapy.Request(url_2, callback=self.parse_detailed_contents,dont_filter=True)
 request.meta['item'] = item
 yield request

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM