EDITED: How do I create a “Nested Loop” that returns an item to the original list in Python and Scrapy

Question

EDIT:

Okay so what I've been doing today was trying to figure this out, unfortunately, I still haven't done so yet. What I have now is this:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self,response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        yield scrapy.Request(response.url, callback = self.primary_parse)
        yield scrapy.Request(response.url, callback = self.secondary_parse)

    def primary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

    def secondary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

The issue is, I can't seem to get the second parse going... I can only ever get one parse going.

Anyway of having two parses going, either simultaneously or progressively?

ORIGINAL:

I'm slowly getting the hang of this (Python and Scrapy) but I've once hit a wall. What I'm trying to do is the following:

There is a photographic retail site, it lists it's products like this:

Name of Camera Body
Price

    With Such and Such Lens
    Price

    With Another Such and Such Lens
    Price

What I want to do is, grab the information and organise it in a list like below (I have no trouble outputting to a csv file):

product,price
camerabody1,$100
camerabody1+lens1,$200
camerabody1+lens1+lens2,$300
camerabody2,$150
camerabody2+lens1,$200
camerabody2+lens1+lens2,$250

My current spider code:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()       
        subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        itemlist = []
        for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            item['product'] = product + " " + subproduct.strip().upper()
            item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

Which does not do what I want, and I have no idea as to what to do next, I've tried doing a for loop within the for loop but that didn't work and it just outputted mixed up results.

Also FYI, my items.py:

import scrapy

    class ArcherItemGeorges(scrapy.Item):
        product = scrapy.Field()
        price = scrapy.Field()
        subproduct = scrapy.Field()
        subprice = scrapy.Field()

Any help would be appreciated, I am trying my best to learn however being new to Python, I feel I need some guidance.

Answer 1

It seems that the structure of the elements you are scraping asks for a loop within a loop as your intuition says. Rearranging your code a little bit, you can get a list with the join of all the product-subproducts.

I have renamed request with product and have introduced the subproduct variable for clarity. I guess the subproduct loop is the one that maybe you were trying to figure out.

def parse(self, response):
    # Loop all the product elements
    for product in response.xpath('//div[@class="listing-item"]'):
        item = ArcherItemGeorges()
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        item['product'] = product_name
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the raw primary item
        yield item
        # Yield the primary item with its secondary items
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            yield item

Of course, you need to apply all the uppercase, price clean up, and etc to the corresponding fields.

Brief explanation:

Once the the page is downloaded the parse method is called with the Response object (the HTML page). From that Response we have to extract/scrape the data in form of items . In this case we want to return a list of product-price items. Here is were the magic of yield expression comes into action. You can think about it as an on demand return that doesn't finishes the execution of the function, aka generator. Scrapy will call the parse generator until it has no more items to yield, and hence, no more items to scrape in the Response .

Commented code:

def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item

Answer 2

First are you sure your are setting or items correctly?

item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
# Should these be 'subproduct' and 'subprice' ? 
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)

second you could think about making helper functions to do tasks you do alot looks a little cleaner.

def getDollars( price ): 
    return price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')

# ... 
item['price'] = getDollars( price ) 
item['subprice'] = getDollars( subprice )

EDITED: How do I create a “Nested Loop” that returns an item to the original list in Python and Scrapy

Question

2 answers

solution1
3 ACCPTED 2014-10-09 17:47:28

solution2
0 2014-10-09 16:09:00

EDITED: How do I create a “Nested Loop” that returns an item to the original list in Python and Scrapy

Question

2 answers

solution1 3 ACCPTED 2014-10-09 17:47:28

solution2 0 2014-10-09 16:09:00

solution1
3 ACCPTED 2014-10-09 17:47:28

solution2
0 2014-10-09 16:09:00