Scrapy需要抓取網站上的所有下一個鏈接，然后移至下一頁

Question

我需要抓緊時間才能轉到下一頁，請給我該規則的正確代碼，該怎么寫？

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from delh.items import DelhItem

class criticspider(CrawlSpider):
    name ="delh"
    allowed_domains =["consumercomplaints.in"]
    #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
    start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),           
              callback="parse_gen", follow= True),
    )
    def parse_gen(self,response):
        hxs = Selector(response)
        sites = hxs.select('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhItem()
            item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
            item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
            items.append(item)
        return items
spider=criticspider()

Answer 1

據我了解，您正在嘗試抓取兩種頁面，因此您應該使用兩個不同的規則：

分頁列表頁面，包含指向n個項目頁面和后續列表頁面的鏈接
物品頁面，從中抓取物品

您的規則應如下所示：

rules = (
    Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)

說明：

第一條規則匹配項目鏈接，並使用您的項目分析方法（ parse_gen ）作為回調。 產生的響應不會再通過這些規則。
第二條規則匹配“pagelinks”，不指定回調，得到的答復便會被這些規則進行處理。

注意：

SgmlLinkExtractor已過時，您應該改用LxmlLinkExtractor （或其別名LinkExtractor ）（源）
發送請求的順序確實很重要，在這種情況下（刮擦未知數量，可能是大量的頁面/項目），您應設法減少在任何給定時間正在處理的頁面數。 為此，我以兩種方式修改了您的代碼：
- 在請求下一個頁面之前，先從當前列表頁面中抓取項目，這就是為什么項目規則位於“頁面鏈接”規則之前的原因。
- 為了避免多次爬網，這就是為什么我向[contains(text(), "Next")] pagelinks]規則中添加了[contains(text(), "Next")]選擇器。 這樣，每個“列表頁面”都會被精確請求一次

Scrapy需要抓取網站上的所有下一個鏈接，然后移至下一頁

問題描述

1 個解決方案

解決方案1
1 2015-02-07 17:52:49

Scrapy需要抓取網站上的所有下一個鏈接，然后移至下一頁

問題描述

1 個解決方案

解決方案1 1 2015-02-07 17:52:49

解決方案1
1 2015-02-07 17:52:49