简体   繁体   English

抓取网站

[英]Scraping a website with scrapy

sHey I've just started using scrapy and was trying it out on a website "diy.com" but i cant seem to get the CrawlSpider to follow links or scrape any data. 嘿,我刚刚开始使用scrapy并正在网站“ diy.com”上进行尝试,但是我似乎无法让CrawlSpider跟踪链接或擦除任何数据。 I think it might be my regex but i cant see anything 我认为这可能是我的正则表达式,但是我什么也看不到

any help will be appreciated 任何帮助将不胜感激

from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from items import PartItem

class DIY_spider(CrawlSpider):
    name = 'diy_cat'
    allowed_domains = ['diy.com']

    start_urls =[
        "http://www.diy.com/nav/decor/tiles/wall-tiles"

    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(r'/(nav)/(decor)/(\w*)/(.*)(\d*)$', ),deny=(r'(.*)/(jsp)/(.*)')), callback='parse_item',follow = True),

def parse_items(self, response):
        sel = Selector(response)
        tests =[]
        test = PartItem()

        if sel.xpath('//*[@id="fullWidthContent"]/div[2]/dl/dd[1]/ul[1]/li[3]/text()') :
            price = sel.xpath('//*[@id="fullWidthContent"]/div[2]/dl/dd[1]/ul[1]/li[3]/text()')
        else:
            price= sel.xpath('//dd[@class="item_cta"]/ul[@class="fright item_price"]/li/text()').extract()
        if not price:
           return test





return test

Your rule states parse_item as the callback but the actual callback is named parse_items . 您的规则将parse_item为回调,但实际的回调名为parse_items Additionally, the indenting for the parse_items function is incorrect, but that could simply be a formatting issue when pasting the code in. 此外, parse_items函数的缩进是不正确的,但是在粘贴代码时,这可能只是格式问题。

Besides, @Talvalin's note, you are not getting actual prices. 此外,@ Talvalin的笔记,您没有得到实际的价格。

Try this version of parse_item : 试试这个版本的parse_item

def parse_item(self, response):
    sel = Selector(response)

    price_list = sel.xpath('//span[@class="onlyPrice"]/text()').extract()
    for price in price_list:
        if price:
            item = PartItem()
            item['price'] = price
            yield item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM