简体   繁体   中英

Scrapy - Recursively scraping to third page

I am hoping my request is quite simple and straightforward for the more experienced Scrapy users out there.

In essence, the following code works well for scraping from a second page based on a link in the first page. I would like to extend the code to scrape from a 3rd page, using a link in the second page. Using the code below, def parse_items is the landing page (1st level) which contains 50 listings and the code is set up to recursively scrape from each of the 50 links. def parse_listing_page specifies which items to scrape from the "listing page". Within each listing page, I would like my script to follow a link through to another page and scrape an item or two before returning to the "listing page" and then back to the landing page.

The code below works well for recursively scraping at 2 levels. How could I expand this to 3 using my code below?

from scrapy import log
from scrapy.log import ScrapyFileLogObserver
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from firstproject.items import exampleItem
from scrapy.http import Request
import urlparse

logfile_info = open('example_INFOlog.txt', 'a')
logfile_error = open('example_ERRlog.txt', 'a')
log_observer_info = log.ScrapyFileLogObserver(logfile_info, level=log.INFO)
log_observer_error = log.ScrapyFileLogObserver(logfile_error, level=log.ERROR)
log_observer_info.start()
log_observer_error.start()

class MySpider(CrawlSpider):
    name = "example"

    allowed_domains = ["example.com.au"]

    rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//li[@class="nextLink"]',))
    , callback="parse_items", follow=True),
    )

    def start_requests(self):
        start_urls = reversed([
            "http://www.example.com.au/1?new=true&list=10-to-100",
            "http://www.example.com.au/2?new=true&list=10-to-100",
            "http://www.example.com.au/2?new=true&list=100-to-200",
        ])

        return[Request(url = start_url) for start_url in start_urls ]

    def parse_start_url(self, response):
        return self.parse_items(response)

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select("//h2")
        items = []
        for listings in listings:
            item = exampleItem()
            item ["title"] = listings.select("a/text()").extract()[0]
            item ["link"] = listings.select("a/@href").extract()[0]
            items.append(item)

            url = "http://example.com.au%s" % item["link"]
            yield Request(url=url, meta={'item':item},callback=self.parse_listing_page)


    def parse_listing_page(self,response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']

        item["item_1"] = hxs.select('#censored Xpath').extract()
        item["item_2"] = hxs.select('#censored Xpath').extract()
        item["item_3"] = hxs.select('#censored Xpath').extract()
        item["item_4"] = hxs.select('#censored Xpath').extract()

        return item

Many thanks

This is how the flow of your code works.

The Rule constructor in the MySpider class is invoked to start with. Rule constructor has the callback set to parse_items . There is a yield at the end of the parse_items which makes the function recurse to parse_listing_page . If you want to recurse to a third level from parse_listing_page there has to be a Request yield from parse_listing_page .

Here is my updated code. The code below is able to pull the counter_link in an appropriate format (tested), but it seems like the else statement is used and so the parse_listing_counter isn't yielded. If I remove the if and else clauses and force the code to callback parse_listing_counter , it doesn't yield any items (not even those from parse_items or the listing page).

What have I done wrong in my code? I've also checked the XPaths - all seem ok.

from scrapy import log
from scrapy.log import ScrapyFileLogObserver
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from firstproject.items import exampleItem
from scrapy.http import Request
import urlparse

logfile_info = open('example_INFOlog.txt', 'a')
logfile_error = open('example_ERRlog.txt', 'a')
log_observer_info = log.ScrapyFileLogObserver(logfile_info, level=log.INFO)
log_observer_error = log.ScrapyFileLogObserver(logfile_error, level=log.ERROR)
log_observer_info.start()
log_observer_error.start()

class MySpider(CrawlSpider):
    name = "example"

    allowed_domains = ["example.com.au"]

    rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//li[@class="nextLink"]',))
    , callback="parse_items", follow=True),
    )

    def start_requests(self):
        start_urls = reversed([
            "http://www.example.com.au/1?new=true&list=10-to-100",
            "http://www.example.com.au/2?new=true&list=10-to-100",
            "http://www.example.com.au/2?new=true&list=100-to-200",
        ])

        return[Request(url = start_url) for start_url in start_urls ]

    def parse_start_url(self, response):
        return self.parse_items(response)

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select("//h2")
        items = []
        for listings in listings:
            item = exampleItem()
            item ["title"] = listings.select("a/text()").extract()[0]
            item ["link"] = listings.select("a/@href").extract()[0]
            items.append(item)

            url = "http://example.com.au%s" % item["link"]
            yield Request(url=url, meta={'item':item},callback=self.parse_listing_page)


    def parse_listing_page(self,response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']

        item["item_1"] = hxs.select('#censored Xpath').extract()
        item["item_2"] = hxs.select('#censored Xpath').extract()
        item["item_3"] = hxs.select('#censored Xpath').extract()
        item["item_4"] = hxs.select('#censored Xpath').extract()

        item["counter_link"] = hxs.selext('#censored Xpath').extract()[0]
        counter_link = response.meta.get('counter_link', None)
        if counter_link:
            url2 = "http://example.com.au%s" % item["counter_link"]
            yield Request(url=url2, meta={'item':item},callback=self.parse_listing_counter)
        else:
            yield item

    def parse_listing_counter(self,response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']

        item["counter"] = hxs.select('#censored Xpath').extract()

        return item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM