[英]Scrapy: 0 pages scraped (works in scrapy shell but not with scrapy crawl spider command)
我在使用scrapy時遇到了一些問題。 它沒有返回任何結果。 我試圖將以下蜘蛛復制並粘貼到scrapy shell中,它確實有效。 真的不確定問題是什么,但是當我用“scrapy crawl rxomega”運行它時,它不起作用。
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from iherb.items import IherbItem
class RxomegaSpider(CrawlSpider):
name = 'rxomega'
allowed_domains = ['http://www.iherb.com/']
start_urls = ['http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/',
'http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/']
#rules = (
# Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
#)
def parse_item(self, response):
print('hello')
sel = Selector(response)
sites = sel.xpath('//*[@id="mainContent"]/div[3]/div[2]/div')
items = []
for site in sites:
i = IherbItem()
i['review'] = site.xpath('div[5]/p/text()').extract()
items.append(i)
return items
我看到的消息是...scrapy crawl rxomega
2014-02-16 17:00:55-0800 [scrapy] INFO: Scrapy 0.22.0 started (bot: iherb)
2014-02-16 17:00:55-0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-02-16 17:00:55-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'iherb.spiders', 'SPIDER_MODULES': ['iherb.spiders'], 'BOT_NAME': 'iherb'}
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-16 17:00:55-0800 [rxomega] INFO: Spider opened
2014-02-16 17:00:55-0800 [rxomega] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-16 17:00:55-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6026
2014-02-16 17:00:55-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6083
2014-02-16 17:00:55-0800 [rxomega] DEBUG: Crawled (200) <GET http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/> (referer: None)
2014-02-16 17:00:56-0800 [rxomega] DEBUG: Crawled (200) <GET http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/> (referer: None)
2014-02-16 17:00:56-0800 [rxomega] INFO: Closing spider (finished)
2014-02-16 17:00:56-0800 [rxomega] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 588,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 37790,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 2, 17, 1, 0, 56, 22065),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 2, 17, 1, 0, 55, 256404)}
2014-02-16 17:00:56-0800 [rxomega] INFO: Spider closed (finished)
genspider 功能創建了一個 CrawlSpider 和 parse_item,但本教程使用 Spider 和 parse。 兩者都是 0.22 版本。 更改為 Spider 並解析上述代碼,它可以工作。
如果您想使用CrawlerSpider
從該網站抓取項目頁面,則必須更改以下兩項:
allowed_domains = ['www.iherb.com']
排除 http:// 前綴rules = ( Rule(SgmlLinkExtractor(allow=r'Items'), callback='parse_item', follow=True), )
公開規則並刪除帖子/
簽名
將RxomegaSpider
類中的參數CrawlSpider
更改為scrapy.Spider
並將函數名稱從parse_item
更改為parse
。 希望能幫助到你
我認為你應該像這樣使用 Spider 和 Parse:
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//*[@id="mainContent"]/div[3]/div[2]/div')
items = []
for site in sites:
i = IherbItem()
i['review'] = site.xpath('div[5]/p/text()').extract()
items.append(i)
return items
嘗試使用 start_requests 方法。 在這種情況下,它總能解決我的問題:):
def start_requests(self):
urls = [
'http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/',
'http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse_item(self, response):
print('hello')
sel = Selector(response)
sites = sel.xpath('//*[@id="mainContent"]/div[3]/div[2]/div')
items = []
for site in sites:
i = IherbItem()
i['review'] = site.xpath('div[5]/p/text()').extract()
items.append(i)
return items
刪除allowed_domains
並使用def start_requests(self):
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.