使scrapy遞歸地移動到下一頁

Question

我正試圖用scrapy刮掉這個頁面。 我可以成功地抓取頁面上的數據，但我希望能夠從其他頁面中抓取數據。 （接下來說的那些）。 繼承了我的代碼的相關部分：

def parse(self, response):
    item = TimemagItem()
    item['title']= response.xpath('//div[@class="text"]').extract()
    links = response.xpath('//h3/a').extract()
    crawledLinks=[]
    linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$")

    for link in links:
        if linkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append(link)
        yield Request(link, self.parse)

    yield item

我得到了正確的信息：來自鏈接頁面的標題，但它根本就不是“導航”。 我怎么告訴scrapy導航？

Answer 1

看看Scrapy Link Extractors文檔。 它們是告訴蜘蛛遵循頁面鏈接的正確方法。

看一下您要抓取的頁面，我相信您應該使用2個提取器規則。 以下是一個簡單蜘蛛的示例，其規則適合您的TIMES網頁需要：

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class TIMESpider(CrawlSpider):
    name = "time_spider"
    allowed_domains = ["time.com"]
    start_urls = [
        'http://search.time.com/results.html?N=45&Ns=p_date_range|1&Ntt=&Nf=p_date_range%7cBTWN+19500101+19500130'
    ]

    rules = (
        Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="tout"]/h3/a',))
            , callback='parse'),
        Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@title="Next"]',))
            , follow= True),
        ) 

    def parse(self, response):
        item = TimemagItem()
        item['title']= response.xpath('.//title/text()').extract()

        return item

使scrapy遞歸地移動到下一頁

問題描述

1 個解決方案

解決方案1
3 2014-10-31 19:51:51

使scrapy遞歸地移動到下一頁

問題描述

1 個解決方案

解決方案1 3 2014-10-31 19:51:51

解決方案1
3 2014-10-31 19:51:51