問：Scrapy：未抓取下一頁，但爬蟲似乎在跟蹤鏈接

Question

我正在嘗試學習python和scrapy，但是CrawlSpider遇到了問題。 下面的代碼對我有用。 它獲取與xpath匹配的起始URL中的所有鏈接- //div[@class="info"]/h3/a/@href然后將這些鏈接傳遞給函數parse_dir_contents 。

我現在需要的是使搜尋器移至下一頁。 我嘗試使用規則和linkextractor，但似乎無法使其正常運行。 我也嘗試使用//a/@href作為解析函數的xpath，但是它不會將鏈接傳遞給parse_dir_contents函數。 我想我確實缺少一些簡單的東西。 有任何想法嗎？

class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]

rules = [
    Rule(LinkExtractor(allow=['restaurants?page=[1-2]']), callback="parse")
]

def parse(self, response):
    for href in response.xpath('//div[@class="info"]/h3/a/@href'):
        url = response.urljoin(href.extract())
        if 'mip' in url:
            yield scrapy.Request(url, callback=self.parse_dir_contents)


def parse_dir_contents(self, response):
    for sel in response.xpath('//div[@id="mip"]'):
        item = ypItem()
        item['url'] = response.url
        item['business'] = sel.xpath('//div/div/h1/text()').extract()
        ---extra items here---
        yield item

編輯：這是具有三個功能的更新代碼，能夠抓取150個項目。 我認為這是我的規則存在的問題，但是我嘗試了一些我認為可行的方法，但是輸出仍然相同。

class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]

rules = [
    Rule(LinkExtractor(allow=[r'restaurants\?page\=[1-2]']), callback='parse')
]

def parse(self, response):
    for href in response.xpath('//a/@href'):
        url = response.urljoin(href.extract())
        if 'restaurants?page=' in url:
            yield scrapy.Request(url, callback=self.parse_links)


def parse_links(self, response):
    for href in response.xpath('//div[@class="info"]/h3/a/@href'):
        url = response.urljoin(href.extract())
        if 'mip' in url:
            yield scrapy.Request(url, callback=self.parse_page)


def parse_page(self, response):
    for sel in response.xpath('//div[@id="mip"]'):
        item = ypItem()
        item['url'] = response.url
        item['business'] = sel.xpath('//div/div/h1/text()').extract()
        item['phone'] = sel.xpath('//div/div/section/div/div[2]/p[3]/text()').extract()
        item['street'] = sel.xpath('//div/div/section/div/div[2]/p[1]/text()').re(r'(.+)\,')
        item['city'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(.+)\,')
        item['state'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'\,\s(.+)\s\d')
        item['zip'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(\d+)')
        item['category'] = sel.xpath('//dd[@class="categories"]/span/a/text()').extract()
        yield item

Answer 1

CrawlSpider將解析例程用於其自身目的，將您的parse()重命名為其他名稱，更改rules[]的回調以匹配並重試。

Answer 2

我知道現在回答這個問題已經很晚了，但是我設法解決了這個問題，我發布了答案，因為它可能對像我這樣一開始就對如何使用scrapy Rule和LinkExtractor感到困惑的人有所幫助。

這是我的工作代碼：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ypSpider(CrawlSpider):
    name = "ypTest"
    allowed_domains = ["yellowpages.com"]
    start_urls = ['http://www.yellowpages.com/new-york-ny/restaurants'
             ]
    rules = (
        Rule(LinkExtractor(allow=[r'restaurants\?page=\d+']), follow=True), # Scrapes all the pagination links 
        Rule(LinkExtractor(restrict_xpaths="//div[@class='scrollable-pane']//a[@class='business-name']"), callback='parse_item'), # Scrapes all the restaurant detail links and use `parse_item` as a callback method
    )

    def parse_item(self, response):
        yield {
            'url' : response.url
        }

因此，我設法了解Rule和LinkExtractor在這種情況下如何工作。

First Rule條目用於刮除所有分頁鏈接，而LinkExtractor函數中的allow參數基本上是使用regex來僅傳遞那些與regex匹配的鏈接。 在這種情況下，按照regex ，僅包含格式的鏈接（例如restaurants\\?page=\\d+ ，其中\\d+表示一個或多個數字。 另外，它使用默認的parse方法作為回調。 （在這種情況下，我可以使用restrict_xpath參數來選擇僅位於HTML中特定區域下的那些鏈接，而allow使用參數，但我可以使用它來了解它如何與regex ）

第二Rule是獲取所有餐廳的詳細信息鏈接，並使用parse_item方法解析這些鏈接。 在此Rule ，我們使用restrict_xpaths參數，該參數定義響應中應從中提取鏈接的區域。 在這里，我們僅獲取div類下具有scrollable-pane類的內容，以及僅獲取具有business-name類的鏈接，就好像您檢查HTML一樣，您會發現多個指向同一餐廳的鏈接具有不同的查詢同一div參數。 最后，我們傳遞了回調方法parse_item 。

現在，當我運行此蜘蛛時，在這種情況下，它將獲取所有餐廳（紐約州紐約的餐廳）的詳細信息，總計3030。

問：Scrapy：未抓取下一頁，但爬蟲似乎在跟蹤鏈接

問題描述

2 個解決方案

解決方案1
1 2016-02-04 16:43:12

解決方案2
0 2019-08-06 10:53:05

問：Scrapy：未抓取下一頁，但爬蟲似乎在跟蹤鏈接

問題描述

2 個解決方案

解決方案1 1 2016-02-04 16:43:12

解決方案2 0 2019-08-06 10:53:05

解決方案1
1 2016-02-04 16:43:12

解決方案2
0 2019-08-06 10:53:05