简体   繁体   中英

Pipeline.py to drop Value rather than Field

I'm currently working on a Scrapy script to pull product information from an Amazon page. The problem I'm running in to is exception handling that drops only the erroneous field rather than the entire item/row in my output.

Current Spider:

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Website


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["amazon.co.uk"]
    start_urls = [
        "http://www.amazon.co.uk/dp/B004YVOU9S",
        "http://www.amazon.co.uk/dp/B009NFE2QQ"
    ]

    def parse(self, response):

        sel = Selector(response)
        sites = sel.xpath('//div[contains(@class, "a-container")]')
        items = []

            for site in sites:
                item = Website()
                item['asin'] = response.url.split('/')[-1]
                item['title'] = site.xpath('div[@id="centerCol"]/div[@id="title_feature_div"]/div[@id="titleSection"]/h1[@id="title"]/span[@id="productTitle"]/text()').extract()
                item['description'] = site.xpath('//*[@id="productDescription"]/div/div[1]/text()').extract()[0].strip()
                item['price'] = site.xpath('//*[@id="priceblock_ourprice"]/text()').extract()
                item['image'] = site.xpath('//*[@id="landingImage"]/@data-a-dynamic-image').extract()
                item['brand'] = site.xpath('//*[@id="brand"]/text()').extract()
                item['bullets'] = site.xpath('//*[@id="feature-bullets"]/span/ul').extract()[0].strip()
                item['category'] = site.xpath('//*[@id="wayfinding-breadcrumbs_feature_div"]/ul').extract()[0].strip()
                item['details'] = site.xpath('//*[@id="prodDetails"]/div/div[1]/div/div/div[2]/div/div/table').extract()[0].strip()
                items.append(item)

            return items

When an scrape result is missing any of the fields, I currently get the error:

exceptions.IndexError: list index out of range

To combat this, I added some exception handling in the form of an IgnoreRequest.

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.exceptions import IgnoreRequest

from dirbot.items import Website


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["amazon.co.uk"]
    start_urls = [
        "http://www.amazon.co.uk/dp/B004YVOU9S",
        "http://www.amazon.co.uk/dp/B009NFE2QQ"
    ]

    def parse(self, response):

        sel = Selector(response)
        sites = sel.xpath('//div[contains(@class, "a-container")]')
        items = []

        try:
            for site in sites:
                item = Website()
                item['asin'] = response.url.split('/')[-1]
                item['title'] = site.xpath('div[@id="centerCol"]/div[@id="title_feature_div"]/div[@id="titleSection"]/h1[@id="title"]/span[@id="productTitle"]/text()').extract()
                item['description'] = site.xpath('//*[@id="productDescription"]/div/div[1]/text()').extract()[0].strip()
                item['price'] = site.xpath('//*[@id="priceblock_ourprice"]/text()').extract()
                item['image'] = site.xpath('//*[@id="landingImage"]/@data-a-dynamic-image').extract()
                item['brand'] = site.xpath('//*[@id="brand"]/text()').extract()
                item['bullets'] = site.xpath('//*[@id="feature-bullets"]/span/ul').extract()[0].strip()
                item['category'] = site.xpath('//*[@id="wayfinding-breadcrumbs_feature_div"]/ul').extract()[0].strip()
                item['details'] = site.xpath('//*[@id="prodDetails"]/div/div[1]/div/div/div[2]/div/div/table').extract()[0].strip()
                items.append(item)

            return items

        except IndexError:
                raise IgnoreRequest("Data type not found.")

What I'd like to do is handle this error in a way that continues to output the rest of the spiders results, dropping only the field with no value, rather than ignoring the entire item.

Any help would be greatly appreciated.

Item Loaders with an input or output processors are what you need here.

Define an ItemLoader with a TakeFirst processor :

Returns the first non-null/non-empty value from the values received, so it's typically used as an output processor to single-valued fields. It doesn't receive any constructor arguments, nor accept Loader contexts.

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst

class ProductLoader(ItemLoader):

    default_output_processor = TakeFirst()

    # specific field loaders

Then, load the item with the loader:

for site in sites:
    l = ProductLoader(Website(), site)
    l.add_value('asin', response.url.split('/')[-1]) # (4)
    l.add_xpath('title', 'div[@id="centerCol"]/div[@id="title_feature_div"]/div[@id="titleSection"]/h1[@id="title"]/span[@id="productTitle"]/text()')
    # ...

    yield l.load_item()

You can do different solutions, If you want to go with try , catch and delete only single field, then you have to do this for all fields,

    try:
        //extract field
    except IndexError:
        raise IgnoreRequest("Data type not found.") 

If you want an empty value instead of dropping then you have to check if value exist or not, you can define a seperate method for extraction

    def get_value_from_node(self, node):
        value = node.extract()
        return value[0] if value else ''

and call this method for all fields

    item['title'] = self.get_value_from_node(site.xpath('div[@id="centerCol"]/div[@id="title_feature_div"]/div[@id="titleSection"]/h1[@id="title"]/span[@id="productTitle"]/text()'))

it will return either value or an empty string. and no need of exception handling.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM