抓取网站

Question

sHey I've just started using scrapy and was trying it out on a website "diy.com" but i cant seem to get the CrawlSpider to follow links or scrape any data. 嘿，我刚刚开始使用scrapy并正在网站“ diy.com”上进行尝试，但是我似乎无法让CrawlSpider跟踪链接或擦除任何数据。 I think it might be my regex but i cant see anything 我认为这可能是我的正则表达式，但是我什么也看不到

any help will be appreciated 任何帮助将不胜感激

from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from items import PartItem

class DIY_spider(CrawlSpider):
    name = 'diy_cat'
    allowed_domains = ['diy.com']

    start_urls =[
        "http://www.diy.com/nav/decor/tiles/wall-tiles"

    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(r'/(nav)/(decor)/(\w*)/(.*)(\d*)$', ),deny=(r'(.*)/(jsp)/(.*)')), callback='parse_item',follow = True),

def parse_items(self, response):
        sel = Selector(response)
        tests =[]
        test = PartItem()

        if sel.xpath('//*[@id="fullWidthContent"]/div[2]/dl/dd[1]/ul[1]/li[3]/text()') :
            price = sel.xpath('//*[@id="fullWidthContent"]/div[2]/dl/dd[1]/ul[1]/li[3]/text()')
        else:
            price= sel.xpath('//dd[@class="item_cta"]/ul[@class="fright item_price"]/li/text()').extract()
        if not price:
           return test





return test

Answer 1

Your rule states parse_item as the callback but the actual callback is named parse_items . 您的规则将parse_item为回调，但实际的回调名为parse_items 。 Additionally, the indenting for the parse_items function is incorrect, but that could simply be a formatting issue when pasting the code in. 此外， parse_items函数的缩进是不正确的，但是在粘贴代码时，这可能只是格式问题。

Answer 2

Besides, @Talvalin's note, you are not getting actual prices. 此外，@ Talvalin的笔记，您没有得到实际的价格。

Try this version of parse_item : 试试这个版本的parse_item ：

def parse_item(self, response):
    sel = Selector(response)

    price_list = sel.xpath('//span[@class="onlyPrice"]/text()').extract()
    for price in price_list:
        if price:
            item = PartItem()
            item['price'] = price
            yield item

抓取网站

问题描述

2 个解决方案

解决方案1
0 2014-04-02 15:35:47

解决方案2
0 已采纳 2014-04-02 15:37:52

抓取网站

问题描述

2 个解决方案

解决方案1 0 2014-04-02 15:35:47

解决方案2 0 已采纳 2014-04-02 15:37:52

解决方案1
0 2014-04-02 15:35:47

解决方案2
0 已采纳 2014-04-02 15:37:52