爬網爬蟲什么都不爬

Question

我正在嘗試檢索Booking.Com。 蜘蛛會打開和關閉，而無需打開和爬網URL。[輸出] [1] [1]： https ://i.stack.imgur.com/9hDt6.png我是python和Scrapy的新手。 這是我到目前為止編寫的代碼。 請指出我做錯了。

import scrapy
import urllib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.loader import ItemLoader
from CinemaScraper.items import CinemascraperItem


class trip(CrawlSpider):
 name="tripadvisor"

def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


def parse(self, response):
        reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
        url = response.urljoin(reviewsurl[0].extract())
        self.pageNumber = 1
        return scrapy.Request(url, callback=self.parse_reviews)


def parse_reviews(self, response):
     for rev in response.xpath('//li[starts-with(@class,"review_item")]'):
            item =CinemascraperItem()
            #sometimes the title is empty because of some reason, not sure when it happens but this works
            title = rev.xpath('.//*[@class="review_item_header_content"]/span[@itemprop="name"]/text()')
            if title:
                item['title'] = title[0].extract()
                positive_content = rev.xpath('.//p[@class="review_pos"]//span/text()')
                if positive_content:
                    item['positive_content'] = positive_content[0].extract()
                negative_content = rev.xpath('.//p[@class="review_neg"]/span/text()')
                if negative_content:
                    item['negative_content'] = negative_content[0].extract()
                item['score'] = rev.xpath('./*[@class="review_item_header_score_container"]/span')[0].extract()
                #tags are separated by ;
                item['tags'] = ";".join(rev.xpath('.//ul[@class="review_item_info_tags/text()').extract())
                yield item

     next_page = response.xpath('//a[@id="review_next_page_link"]/@href')
     if next_page:
      url = response.urljoin(next_page[0].extract())
      yield scrapy.Request(url, self.parse_reviews)

Answer 1

我想指出，在您提到的問題中，您提到的是一個網站booking.com，但在蜘蛛網中，您擁有該網站的鏈接，這是scrapy教程的正式文檔...將繼續使用引號網站為了解釋....

好的，我們開始...因此，在您的代碼段中，您正在使用爬網蜘蛛，其中值得一提的是，解析功能已經是爬網蜘蛛背后邏輯的一部分。 就像我之前提到的，通過將解析重命名為不同的名稱，例如parse_item，這是創建滾動蜘蛛時的默認初始函數，但實際上您可以根據需要命名。 通過這樣做，我相信我實際上應該對網站進行爬網，但這完全取決於您的代碼是否正確。

簡而言之，通用爬蟲和它們的爬蟲之間的區別在於，使用爬蟲時，您使用模塊（例如鏈接提取器）和規則（其中規則設置了某些參數），以便當起始URL遵循用於導航的模式時遍歷頁面，使用各種有用的參數來執行此操作...最后一個規則集是將車拋光到其中的規則。 換句話說，...爬行蜘蛛會創建用於請求導航的邏輯。

請注意，在規則集中....我輸入...“ / page。 ” ....使用“。 ”是一個正則表達式，它表示....“遵循模式.... / page的頁面，它將遵循AND回調至parse_item ...”

這是一個超級簡單的示例...您可以輸入模式以僅跟隨或僅回調您的項目解析功能...

使用普通蜘蛛，您必須手動鍛煉站點導航以獲得所需的內容...

爬行蜘蛛

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from quotes.items import QuotesItem

class QcrawlSpider(CrawlSpider):
    name = 'qCrawl'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(allow=r'page/.*'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = QuotesItem()
        item['quote'] =response.css('span.text::text').extract()
        item['author'] = response.css('small.author::text').extract()
        yield item

通用蜘蛛

import scrapy
from quotes.items import QuotesItem

class QspiSpider(scrapy.Spider):
    name = "qSpi"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css("div.quote"):
            item = QuotesItem()
            item['quote'] = quote.css('span.text::text').extract()
            item['author'] = quote.css('small.author::text').extract()
            item['tags'] = quote.css("div.tags > a.tag::text").extract()
            yield item

        for nextPage in response.css('li.next a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(nextPage))

一種

編輯：應OP要求提供附加信息

“ ...我無法理解如何向Rule參數添加參數”

好吧...讓我們看一下官方文檔，只是重申一下爬網蜘蛛的定義...

因此，抓取蜘蛛通過使用規則集在后面的鏈接后面創建邏輯...現在讓我說，我想使用僅適用於待售房屋待售物品的抓取蜘蛛對craigslist進行抓取...。我希望您注意到紅色的....

第一是要顯示當我在craigslist房屋持有商品頁面上時

因此，我們搜集到……“搜索/ hsh ...”中的任何內容都將是提貨單清單的頁面，這些頁面是提單頁面的第一頁。

對於大紅色數字“ 2” ...表示當我們在實際項目中發布時...所有項目似乎都有“ ... / hsh / ...”，因此previs頁面內的任何鏈接我想要跟隨並從那里刮擦的模式...所以我的蜘蛛會像...

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from craigListCrawl.items import CraiglistcrawlItem

class CcrawlexSpider(CrawlSpider):
    name = 'cCrawlEx'
    allowed_domains = ['columbia.craigslist.org']
    start_urls = ['https://columbia.craigslist.org/']

    rules = (
        Rule(LinkExtractor(allow=r'search/hsa.*'), follow=True),
        Rule(LinkExtractor(allow=r'hsh.*'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = CraiglistcrawlItem()
        item['title'] = response.css('title::text').extract()
        item['description'] = response.xpath("//meta[@property='og:description']/@content").extract()
        item['followLink'] = response.xpath("//meta[@property='og:url']/@content").extract()
        yield item

我想讓您把它想成是從着陸頁到達包含內容的頁面的步驟...因此，我們登陸到了我們的start_url頁面上。就像您看到的第一個規則一樣...

規則（LinkExtractor（allow = r'search / hsa。*'），關注= True）

這里說允許遵循正則表達式模式“ search / hsa。 ” ...請記住，“。 ”是一個正則表達式，在這種情況下，它至少匹配“ search / hsa”之后的任何內容。

因此，邏輯繼續進行，然后說，具有模式“ hsh。*”的任何鏈接都將被回調回我的parse_item

如果您將其視為從頁面到另一步的“點擊”，它應該會有所幫助……盡管完全可以接受，但通用抓取工具將為您提供最大程度的控制權，以使您的抓項目難以使用一個寫得好的蜘蛛應該更精確，更快。

Answer 2

您正在重寫CrawlSpider子類上的parse方法，根據文檔不建議這樣做：

編寫爬網蜘蛛規則時，請避免將解析用作回調，因為CrawlSpider使用解析方法本身來實現其邏輯。 因此，如果您覆蓋parse方法，則爬網蜘蛛將不再起作用。

不過，我在您的Spider中看不到“規則”，因此建議您切換到scrapy.spiders.Spider而不是scrapy.spiders.CrawlSpider 。 只需繼承Spider類並再次運行它，它便會按預期工作。

爬網爬蟲什么都不爬

問題描述

2 個解決方案

解決方案1
2 已采納 2017-06-19 18:15:21

爬行蜘蛛

通用蜘蛛

解決方案2
0 2017-06-19 06:42:43

爬網爬蟲什么都不爬

問題描述

2 個解決方案

解決方案1 2 已采納 2017-06-19 18:15:21

爬行蜘蛛

通用蜘蛛

解決方案2 0 2017-06-19 06:42:43

解決方案1
2 已采納 2017-06-19 18:15:21

解決方案2
0 2017-06-19 06:42:43