简体   繁体   中英

Gathering and storing data with scrapy

I'm pretty new to development and very new to scrapy, I've gotten so far with the docs, but I've hit a wall I can't seem to pass. Below is the basic spider I have (urls changed to protect the innocent).

The start url contains a list of product categories, those link to pages with a list of sub categories that link to the product pages I want to parse.

My spider currently runs without error, seems to fetch all the pages I want, but doesn't want to call parse_product() . Here's the code:

# coding=utf-8
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item, Field

class MySpider(CrawlSpider):
    name = "MySpider"
    allowed_domains = ["www.mysite.com"]
    start_urls = ["http://www.mysite.com/start_page/",]

    rules = (
    Rule(SgmlLinkExtractor(restrict_xpaths='//body[@class="item"] '),     # Product page
                                    callback='parse_product'),
    Rule(SgmlLinkExtractor(restrict_xpaths="//ul[@id='products_ul']/li"), # First 2 pages - same xpath to follow.
                                    )
    )

    def parse_product(self, response):
        print " parsing product" # Print statement added for testing purposes - test failed.
        hxs = HtmlXPathSelector(response)
        item = MySpiderItem()
        item['name']  = hxs.select('/xpath/to/name')
        item['description'] = hxs.select('/xpath/to/description' )
        item['image'] = hxs.select('/xpath/to/image')
        print item    # Print statement added for testing purposes - test failed.
        return item   


class MySpiderItem(Item):
    name = Field()
    description = Field()
    image = Field()

Questions:

1) Should this do what I want it to do?

Ok, clearly, no it doesn't, that's why I'm here! but I'm not sure if this is down to bad xpaths or if I'm calling parse_product incorrectly, eg: Do I need that link extractor for the product pages? I'm not following links from there, but then how to I target them to parse without it?)

In theory, it should only ever get 2 types of page, cats/subcat pages with lists of links ("//ul[@id='products_ul']/li") to be followed and product pages that need to be parsed (only consistant identifier for these is <body class="mainpage version2 **item**"> vs <body class="mainpage version2 **category**"> )

2) How do I save the output to a csv (or any other simple text format for that matter)?

The documentation for this has me confused, (Though I'm sure this is down to my lack of understanding rather than poor documentation, as on the whole it's excellent) it seems to send you round in circles, and gives examples without saying which file the examples should be written to.

I'm currently using this spider as a stand alone file with $ scrapy runspider filename.py for ease of testing, but I'm happy to set it up as a full scrapy spider if that makes things easier.

ok you not defined where it should go if it match Rule, so by defualt it is calling parse_product , if you wish to not to go in parse_product you can mention any callback it will go there like callback='parse_other' it will go in parse_other rather then parse_product

right now you are not setting up any scrapy project you have to use python csv module

hint is you can create file and writer object in init method and write each item in file though parse_product

if you want to set up scrapy project , scrapy comes with default exporter you just need to mention these settings in settings.py

FEED_EXPORTERS = {
     'csv': 'scrapy.contrib.exporter.CsvItemExporter',
} # enabling Csv Exporter
FEED_FORMAT = 'csv' # output formate
FEED_URI = "output.csv" # file name and path

rest scrapy builten exporter will do for you. hope it helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM