簡體   English   中英

刮y的出口空的CSV

[英]scrapy export empty csv

我的問題是以下內容:scrapy export empty csv。

我的代碼結構形狀:

items.py:

import scrapy


class BomnegocioItem(scrapy.Item):
    title = scrapy.Field()
    pass

pipelines.py:

class BomnegocioPipeline(object):
    def process_item(self, item, spider):
        return item

settings.py:

BOT_NAME = 'bomnegocio'

SPIDER_MODULES = ['bomnegocio.spiders']
NEWSPIDER_MODULE = 'bomnegocio.spiders'
LOG_ENABLED = True

bomnegocioSpider.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from bomnegocio.items  import BomnegocioItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
import csv
import urllib2

class bomnegocioSpider(CrawlSpider):

    name = 'bomnegocio'
    allowed_domains = ["http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    rules = (Rule (SgmlLinkExtractor(allow=r'/fogao')
    , callback="parse_bomnegocio", follow= True),
    )

    print "=====> Start data extract ...."

    def parse_bomnegocio(self,response):                                                     
        #hxs = HtmlXPathSelector(response)

        #items = [] 
        item = BomnegocioItem()     

        item['title'] = response.xpath("//*[@id='ad_title']/text()").extract()[0]                        
        #items.append(item)

        return item

    print "=====> Finish data extract."     

    #//*[@id="ad_title"]

終奌站 :

$ scrapy crawl bomnegocio -o dataextract.csv -t csv

=====> Start data extract ....
=====> Finish data extract.
2014-12-12 13:38:45-0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: bomnegocio)
2014-12-12 13:38:45-0200 [scrapy] INFO: Optional features available: ssl, http11
2014-12-12 13:38:45-0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bomnegocio.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['bomnegocio.spiders'], 'FEED_URI': 'dataextract.csv', 'BOT_NAME': 'bomnegocio'}
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled item pipelines: 
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider opened
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Crawled (200) <GET http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713> (referer: None)
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?t=&u=http%3A%2F%2Fsp.bomnegocio.com%2Fregiao-de-bauru-e-marilia%2Feletrodomesticos%2Ffogao-industrial-itajobi-4-bocas-c-forno-54183713>
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Closing spider (finished)
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 308,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 8503,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 538024),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'offsite/domains': 1,
     'offsite/filtered': 1,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 119067)}
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider closed (finished)

為什么呢

===> 2014-12-12 13:38:45-0200 [bomnegocio]信息:抓取0頁(以0頁/分鍾),抓取0件(以0件/分鍾)

$ nano dataextract.csv

看起來是空的。 =(

我做一些假設:

我)我的檢索語句提供了錯誤的xpath? 我去終端輸入

$ scrapy shell "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    >>> response.xpath("//*[@id='ad_title']/text()").extract()[0] 
u'\n\t\t\t\n\t\t\t\tFog\xe3o industrial itajobi 4 bocas c/ forno \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t- '

答:不,問題不在xpath句子中

ii)我的“進口”? 在日志視圖上不顯示“導入”問題。

感謝您的關注,我現在期待聽到您的意見。

這個蜘蛛有一些問題:

1) allowed_domains是用於域的,因此您要使用:

allowed_domains = ["bomnegocio.com"]

2)此處規則的用法不是很充分,因為它們是用於定義應如何爬網的站點-要遵循的鏈接。 在這種情況下,您無需跟蹤任何鏈接,只想直接從start_urls列出的URL抓取數據,所以我建議您擺脫rules屬性,使Spider擴展scrapy.Spider取而代之的是scrapy.Spider並在默認的回調parse抓取數據:

from testing.items import BomnegocioItem
import scrapy

class bomnegocioSpider(scrapy.Spider):

    name = 'bomnegocio'
    allowed_domains = ["bomnegocio.com"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    def parse(self,response):
        print "=====> Start data extract ...."
        yield BomnegocioItem(
            title=response.xpath("//*[@id='ad_title']/text()").extract()[0]
        )
        print "=====> Finish data extract."

還請注意,print語句現在位於回調內部的方式以及yield而不是return的用法(這使您可以從一頁生成多個項目)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM