简体   繁体   English

Scrapy需要抓取网站上的所有下一个链接,然后移至下一页

[英]Scrapy needs to crawl all the next links on website and move on to the next page

I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it?? 我需要抓紧时间才能转到下一页,请给我该规则的正确代码,该怎么写?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from delh.items import DelhItem

class criticspider(CrawlSpider):
    name ="delh"
    allowed_domains =["consumercomplaints.in"]
    #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
    start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),           
              callback="parse_gen", follow= True),
    )
    def parse_gen(self,response):
        hxs = Selector(response)
        sites = hxs.select('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhItem()
            item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
            item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
            items.append(item)
        return items
spider=criticspider()

From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules : 据我了解,您正在尝试抓取两种页面,因此您应该使用两个不同的规则:

  • paginated list pages, containing links to n items pages and to subsequent list pages 分页列表页面,包含指向n个项目页面和后续列表页面的链接
  • items pages, from which you scrape your items 物品页面,从中抓取物品

Your rules should then look something like : 您的规则应如下所示:

rules = (
    Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)

Explanations : 说明:

  • The first rules matches item links and uses your item parsing method ( parse_gen ) as callback. 第一条规则匹配项目链接,并使用您的项目分析方法( parse_gen )作为回调。 The resulting responses do not go through these rules again. 产生的响应不会再通过这些规则。
  • the second rule matches "pagelinks" and does not specify a callback, the resulting responses will then be handled by these rules. 第二条规则匹配“pagelinks”,不指定回调,得到的答复便会被这些规则进行处理。

Notice : 注意 :

  • SgmlLinkExtractor is obsolete and you should use LxmlLinkExtractor (or its alias LinkExtractor ) instead ( source ) SgmlLinkExtractor已过时,您应该改用LxmlLinkExtractor (或其别名LinkExtractor )(
  • The order in which you send out your requests does matter and, in this sort of situation (scraping an unknown, potentially large, amount of pages/items), you should seek to reduce the number of pages being processed at any given time. 发送请求的顺序确实很重要,在这种情况下(刮擦未知数量,可能是大量的页面/项目),您应设法减少在任何给定时间正在处理的页面数。 To this end I've modified your code in two ways : 为此,我以两种方式修改了您的代码:
    • scrape the items from the current list page before requesting the next one, this is why the item rule is before the "pagelinks" rule. 请求下一个页面之前,先从当前列表页面中抓取项目,这就是为什么项目规则位于“页面链接”规则之前的原因。
    • avoid crawling a page several times over, this is why I added the [contains(text(), "Next")] selector to the "pagelinks" rule. 为了避免多次爬网,这就是为什么我向[contains(text(), "Next")] pagelinks]规则中添加了[contains(text(), "Next")]选择器。 This way each "list page" gets requested exactly once 这样,每个“列表页面”都会被精确请求一次

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM