简体   繁体   中英

Python Recursive Scraping with Scrapy

I'm trying to make a scraper that will pull links, titles, prices and the body of posts on craigslist. I have been able to get the prices, but it returns the price for every listing on the page, not just for the specific row. I am also unable to get it to go to the next page and continue scraping.

This is the tutorial I am using - http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/

I've tried suggestions from this thread, but still can't make it work - Scrapy Python Craigslist Scraper

The page I'm trying to scrape is - http://medford.craigslist.org/cto/

In the link price variable, if I remove the // before span[@class="l2"] it returns no prices, but if I leave it there it includes every price on the page.

For the rules, I've tried playing with the class tags but it seems to hang on the first page. I'm thinking I might need separate spider classes?

Here is my code:

#-------------------------------------------------------------------------------
# Name:        module1
# Purpose:
#
# Author:      CD
#
# Created:     02/03/2014
# Copyright:   (c) CD 2014
# Licence:     <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
import sys

class PageSpider(BaseSpider):
    name = "cto"
    allowed_domains = ["medford.craigslist.org"]
    start_urls = ["http://medford.craigslist.org/cto/"]

    rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//span[@class="button next"]' ,))
        , callback="parse", follow=True), )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//span[@class="pl"] | //span[@class="l2"]')

        for title in titles:
            item = CraigslistSampleItem()
            item['title'] = title.select("a/text()").extract()
            item['link'] = title.select("a/@href").extract()
            item['price'] = title.select('//span[@class="l2"]//span[@class="price"]/text()').extract()

            url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)


    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

The idea is simple: find all paragraphs in a div with a class="content" . Then from every paragraph extract link, text link and a price. Note that select() method is deprecated currentlty, use xpath() instead.

Here's a modified version of parse() method:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    rows = hxs.select('//div[@class="content"]/p[@class="row"]')

    for row in rows:
        item = CraigslistSampleItem()
        link = row.xpath('.//span[@class="pl"]/a')
        item['title'] = link.xpath("text()").extract()
        item['link'] = link.xpath("@href").extract()
        item['price'] = row.xpath('.//span[@class="l2"]/span[@class="price"]/text()').extract()

        url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
        yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)

This is a sample of what I'm getting:

{'description': [u"\n\t\tHave a nice, sturdy, compact car hauler/trailer.  May be used for other hauling like equipstment, ATV's and the like,   Very solid and in good shape.   Parice to sell at only $995.   Call Bill at 541 944 2929 top see or Roy at 541 9733421.   \n\t"],
 'link': [u'/cto/4354771900.html'],
 'price': [u'$995'],
 'title': [u'compact sturdy car trailer ']}

Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM