简体   繁体   中英

Scrapy needs to crawl all the next links on website and move on to the next page

I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it??

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from delh.items import DelhItem

class criticspider(CrawlSpider):
    name ="delh"
    allowed_domains =["consumercomplaints.in"]
    #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
    start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),           
              callback="parse_gen", follow= True),
    )
    def parse_gen(self,response):
        hxs = Selector(response)
        sites = hxs.select('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhItem()
            item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
            item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
            items.append(item)
        return items
spider=criticspider()

From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules :

  • paginated list pages, containing links to n items pages and to subsequent list pages
  • items pages, from which you scrape your items

Your rules should then look something like :

rules = (
    Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)

Explanations :

  • The first rules matches item links and uses your item parsing method ( parse_gen ) as callback. The resulting responses do not go through these rules again.
  • the second rule matches "pagelinks" and does not specify a callback, the resulting responses will then be handled by these rules.

Notice :

  • SgmlLinkExtractor is obsolete and you should use LxmlLinkExtractor (or its alias LinkExtractor ) instead ( source )
  • The order in which you send out your requests does matter and, in this sort of situation (scraping an unknown, potentially large, amount of pages/items), you should seek to reduce the number of pages being processed at any given time. To this end I've modified your code in two ways :
    • scrape the items from the current list page before requesting the next one, this is why the item rule is before the "pagelinks" rule.
    • avoid crawling a page several times over, this is why I added the [contains(text(), "Next")] selector to the "pagelinks" rule. This way each "list page" gets requested exactly once

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM