简体   繁体   中英

Scraping/crawling multiple pages

Up to now I have found how to scrape one page or multiple pages with same url, but changing number. However, I could not find how to scrape pages with subcategories and their subcategories and finally get the content needed. I am trying to scrape this website: http://www.askislam.org/index.html I am using Scrapy, but I do not know where to start. Or you can suggest a better option, I just use python and check from there.

Thanks

# -*- coding: utf-8 -*-
from scrapy.spiders                 import CrawlSpider, Rule
from scrapy.linkextractors          import LinkExtractor
from scrapy.spiders                 import Spider
from scrapy                         import Selector
from ask_islam.items                import AskIslamItem
from scrapy.http                    import Request
from scrapy.linkextractors          import LinkExtractor
import  re

class AskislamSpider(Spider):
    name = "askislam"
    allowed_domains = ["askislam.org"]
    start_urls = ['http://www.askislam.org/']
    rules = [Rule(LinkExtractor(allow = ()), callback = 'parse', follow=True)]

    def parse(self, response):
        hxs = Selector(response)
        links = hxs.css('div[id="categories"] li a::attr(href)').extract()
        for link in links:
            url = 'http://www.askislam.org' + link.replace('index.html', '')
            yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        hxs = Selector(response)
        categories = hxs.css('div[id="categories"] li').extract()
        questions = hxs.xpath('a').extract()
        if(categories):
            for categoryLink in categories:
                url = 'http://www.askislam.org' + categoryLink.replace('index.html', '')
                yield Request(url, callback=self.parse_page)
                # print (question)

EDIT

def start_requests(self):
    yield Request("http://www.askislam.org", callback=self.parse_page)

def parse_page(self, response):
    hxs = Selector(response)
    categories = hxs.css('#categories li')
    for cat in categories:
        item = AskIslamItem()
        link = cat.css('a::attr(href)').extract()[0]
        link = "http://www.askislam.org" + link

        item['catLink'] = link

        logging.info("Scraping Link: %s" % (link))

        yield Request(link, callback=self.parse_page)
        yield Request(link, callback=self.parse_categories)

def parse_categories(self, response):
    logging.info("The Cat Url")

Read links from that http://www.askislam.org/index.html page using xPath or CSS Selectors of those sub-categories and then do another Request()

EDIT:

import logging

class AskislamSpider(Spider):

    name = "askislam"

    def start_requests(self):

        yield Request("http://www.askislam.org/", callback=self.parse_page)

    def parse_page(self, response):
        categories = response.css('#categories li').extract()
        for cat in categories:
            link = cat.css("a::attr(href)").extract()[0]
            link = "http://www.askislam.org/" + link

            logging.info("Scraping Link: %s" % (link))

            yield Request(link, callback=self.parse_page)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM