簡體   English   中英

Scrapy:抓取了 0 頁(在scrapy shell 中工作,但不適用於scrapy crawl spider 命令)

[英]Scrapy: 0 pages scraped (works in scrapy shell but not with scrapy crawl spider command)

我在使用scrapy時遇到了一些問題。 它沒有返回任何結果。 我試圖將以下蜘蛛復制並粘貼到scrapy shell中,它確實有效。 真的不確定問題是什么,但是當我用“scrapy crawl rxomega”運行它時,它不起作用。

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from iherb.items import IherbItem

class RxomegaSpider(CrawlSpider):
    name = 'rxomega'
    allowed_domains = ['http://www.iherb.com/']
    start_urls = ['http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/',
            'http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/']
    #rules = (
    #    Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    #)

    def parse_item(self, response):
        print('hello')
        sel = Selector(response)
        sites = sel.xpath('//*[@id="mainContent"]/div[3]/div[2]/div')
        items = []
        for site in sites:
            i = IherbItem()
            i['review'] = site.xpath('div[5]/p/text()').extract()
            items.append(i)
        return items

我看到的消息是...scrapy crawl rxomega

2014-02-16 17:00:55-0800 [scrapy] INFO: Scrapy 0.22.0 started (bot: iherb)
2014-02-16 17:00:55-0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-02-16 17:00:55-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'iherb.spiders', 'SPIDER_MODULES': ['iherb.spiders'], 'BOT_NAME': 'iherb'}
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-16 17:00:55-0800 [rxomega] INFO: Spider opened
2014-02-16 17:00:55-0800 [rxomega] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-16 17:00:55-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6026
2014-02-16 17:00:55-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6083
2014-02-16 17:00:55-0800 [rxomega] DEBUG: Crawled (200) <GET http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/> (referer: None)
2014-02-16 17:00:56-0800 [rxomega] DEBUG: Crawled (200) <GET http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/> (referer: None)
2014-02-16 17:00:56-0800 [rxomega] INFO: Closing spider (finished)
2014-02-16 17:00:56-0800 [rxomega] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 588,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 37790,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 2, 17, 1, 0, 56, 22065),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2014, 2, 17, 1, 0, 55, 256404)}
2014-02-16 17:00:56-0800 [rxomega] INFO: Spider closed (finished)

genspider 功能創建了一個 CrawlSpider 和 parse_item,但本教程使用 Spider 和 parse。 兩者都是 0.22 版本。 更改為 Spider 並解析上述代碼,它可以工作。

如果您想使用CrawlerSpider從該網站抓取項目頁面,則必須更改以下兩項:

  1. allowed_domains = ['www.iherb.com']排除 http:// 前綴
  2. rules = ( Rule(SgmlLinkExtractor(allow=r'Items'), callback='parse_item', follow=True), )

公開規則並刪除帖子/簽名

RxomegaSpider類中的參數CrawlSpider更改為scrapy.Spider並將函數名稱從parse_item更改為parse 希望能幫助到你

我認為你應該像這樣使用 Spider 和 Parse:

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//*[@id="mainContent"]/div[3]/div[2]/div')
    items = []
    for site in sites:
        i = IherbItem()
        i['review'] = site.xpath('div[5]/p/text()').extract()
        items.append(i)
    return items

嘗試使用 start_requests 方法。 在這種情況下,它總能解決我的問題:):

def start_requests(self):
    urls = [
        'http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/',
        'http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse_item(self, response):
    print('hello')
    sel = Selector(response)
    sites = sel.xpath('//*[@id="mainContent"]/div[3]/div[2]/div')
    items = []
    for site in sites:
        i = IherbItem()
        i['review'] = site.xpath('div[5]/p/text()').extract()
        items.append(i)
    return items

刪除allowed_domains並使用def start_requests(self):

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM