简体   繁体   English

编辑:如何创建“嵌套循环”,将项目返回到Python和Scrapy中的原始列表

[英]EDITED: How do I create a “Nested Loop” that returns an item to the original list in Python and Scrapy

EDIT: 编辑:

Okay so what I've been doing today was trying to figure this out, unfortunately, I still haven't done so yet. 好的,所以我今天要做的只是想弄清楚这一点,不幸的是,我还没有这样做。 What I have now is this: 我现在所拥有的是:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self,response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        yield scrapy.Request(response.url, callback = self.primary_parse)
        yield scrapy.Request(response.url, callback = self.secondary_parse)

    def primary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

    def secondary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

The issue is, I can't seem to get the second parse going... I can only ever get one parse going. 问题是,我似乎无法进行第二次解析……我只能进行一次解析。

Anyway of having two parses going, either simultaneously or progressively? 不管是同时进行还是逐步进行两个解析?


ORIGINAL: 原版的:

I'm slowly getting the hang of this (Python and Scrapy) but I've once hit a wall. 我正在慢慢掌握它(Python和Scrapy),但是我曾经碰壁。 What I'm trying to do is the following: 我想做的是以下几点:

There is a photographic retail site, it lists it's products like this: 有一个摄影零售网站,上面列出了这样的产品:

Name of Camera Body
Price

    With Such and Such Lens
    Price

    With Another Such and Such Lens
    Price

What I want to do is, grab the information and organise it in a list like below (I have no trouble outputting to a csv file): 我想要做的是,获取信息并将其组织在如下列表中(输出到csv文件没有问题):

product,price
camerabody1,$100
camerabody1+lens1,$200
camerabody1+lens1+lens2,$300
camerabody2,$150
camerabody2+lens1,$200
camerabody2+lens1+lens2,$250

My current spider code: 我当前的蜘蛛代码:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()       
        subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        itemlist = []
        for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            item['product'] = product + " " + subproduct.strip().upper()
            item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

Which does not do what I want, and I have no idea as to what to do next, I've tried doing a for loop within the for loop but that didn't work and it just outputted mixed up results. 哪个不执行我想要的,并且我不知道下一步该怎么做,我尝试在for循环内执行一个for循环,但这没有用,它只是输出混合结果。

Also FYI, my items.py: 也供我参考,我的items.py:

import scrapy

    class ArcherItemGeorges(scrapy.Item):
        product = scrapy.Field()
        price = scrapy.Field()
        subproduct = scrapy.Field()
        subprice = scrapy.Field()

Any help would be appreciated, I am trying my best to learn however being new to Python, I feel I need some guidance. 任何帮助将不胜感激,我正在努力学习,但是对Python来说是新手,我觉得我需要一些指导。

It seems that the structure of the elements you are scraping asks for a loop within a loop as your intuition says. 正如您的直觉所说,您要抓取的元素的结构似乎在一个循环中要求一个循环。 Rearranging your code a little bit, you can get a list with the join of all the product-subproducts. 重新整理一下代码,您可以获得包含所有产品子产品的联接的列表。

I have renamed request with product and have introduced the subproduct variable for clarity. 为了清楚起见,我已将request重命名为product ,并引入了subproduct变量。 I guess the subproduct loop is the one that maybe you were trying to figure out. 我猜想subproduct循环是您可能想找出的那个。

def parse(self, response):
    # Loop all the product elements
    for product in response.xpath('//div[@class="listing-item"]'):
        item = ArcherItemGeorges()
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        item['product'] = product_name
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the raw primary item
        yield item
        # Yield the primary item with its secondary items
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            yield item

Of course, you need to apply all the uppercase, price clean up, and etc to the corresponding fields. 当然,您需要将所有大写字母,价格清除等应用于相应的字段。

Brief explanation: 简要说明:

Once the the page is downloaded the parse method is called with the Response object (the HTML page). 下载页面后,将使用Response对象(HTML页面)调用parse方法。 From that Response we have to extract/scrape the data in form of items . 从该Response我们必须以items形式提取/收集数据。 In this case we want to return a list of product-price items. 在这种情况下,我们要返回产品价格清单。 Here is were the magic of yield expression comes into action. 这就是收益表达的神奇作用。 You can think about it as an on demand return that doesn't finishes the execution of the function, aka generator. 您可以将其视为按需 return ,它不能完成函数(即生成器)的执行。 Scrapy will call the parse generator until it has no more items to yield, and hence, no more items to scrape in the Response . Scrapy将调用parse生成器,直到它没有更多要产生的items ,因此在Response不再有要抓取的items

Commented code: 注释代码:

def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item

First are you sure your are setting or items correctly? 首先,您确定自己的设置或项目正确吗?

item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
# Should these be 'subproduct' and 'subprice' ? 
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)

second you could think about making helper functions to do tasks you do alot looks a little cleaner. 其次,您可以考虑使辅助函数执行您要完成的任务,这看起来会更简洁一些。

def getDollars( price ): 
    return price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')

# ... 
item['price'] = getDollars( price ) 
item['subprice'] = getDollars( subprice )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM