简体   繁体   中英

scrapy export empty csv

My question is the following : scrapy export empty csv.

My code structural shape :

items.py :

import scrapy


class BomnegocioItem(scrapy.Item):
    title = scrapy.Field()
    pass

pipelines.py :

class BomnegocioPipeline(object):
    def process_item(self, item, spider):
        return item

settings.py:

BOT_NAME = 'bomnegocio'

SPIDER_MODULES = ['bomnegocio.spiders']
NEWSPIDER_MODULE = 'bomnegocio.spiders'
LOG_ENABLED = True

bomnegocioSpider.py :

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from bomnegocio.items  import BomnegocioItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
import csv
import urllib2

class bomnegocioSpider(CrawlSpider):

    name = 'bomnegocio'
    allowed_domains = ["http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    rules = (Rule (SgmlLinkExtractor(allow=r'/fogao')
    , callback="parse_bomnegocio", follow= True),
    )

    print "=====> Start data extract ...."

    def parse_bomnegocio(self,response):                                                     
        #hxs = HtmlXPathSelector(response)

        #items = [] 
        item = BomnegocioItem()     

        item['title'] = response.xpath("//*[@id='ad_title']/text()").extract()[0]                        
        #items.append(item)

        return item

    print "=====> Finish data extract."     

    #//*[@id="ad_title"]

terminal :

$ scrapy crawl bomnegocio -o dataextract.csv -t csv

=====> Start data extract ....
=====> Finish data extract.
2014-12-12 13:38:45-0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: bomnegocio)
2014-12-12 13:38:45-0200 [scrapy] INFO: Optional features available: ssl, http11
2014-12-12 13:38:45-0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bomnegocio.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['bomnegocio.spiders'], 'FEED_URI': 'dataextract.csv', 'BOT_NAME': 'bomnegocio'}
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled item pipelines: 
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider opened
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Crawled (200) <GET http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713> (referer: None)
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?t=&u=http%3A%2F%2Fsp.bomnegocio.com%2Fregiao-de-bauru-e-marilia%2Feletrodomesticos%2Ffogao-industrial-itajobi-4-bocas-c-forno-54183713>
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Closing spider (finished)
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 308,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 8503,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 538024),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'offsite/domains': 1,
     'offsite/filtered': 1,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 119067)}
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider closed (finished)

Why ?

===> 2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

$ nano dataextract.csv

Look's empty. =(

I do some hypotheses :

i) My crawl sentence provide wrong xpath ? I go to terminal and type

$ scrapy shell "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    >>> response.xpath("//*[@id='ad_title']/text()").extract()[0] 
u'\n\t\t\t\n\t\t\t\tFog\xe3o industrial itajobi 4 bocas c/ forno \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t- '

Answer : No, the problem is not in the xpath sentence

ii) Mys "import" ? On the log view do not show "import"problems.

Thank you for your attention and I now look forward to hearing your views.

There are a few issues with this spider:

1) allowed_domains is meant to be used for domains, so you want to use:

allowed_domains = ["bomnegocio.com"]

2) The usage of the rules are not very adequate here, because they are meant for defining how the site should be crawled -- which links to follow. In this case, you don't need to follow any links, you just want to scrape the data directly from the URLs you're listing in start_urls , so I suggest you just get rid of the rules attribute, make the spider extend scrapy.Spider instead and scrape the data in the default callback parse :

from testing.items import BomnegocioItem
import scrapy

class bomnegocioSpider(scrapy.Spider):

    name = 'bomnegocio'
    allowed_domains = ["bomnegocio.com"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    def parse(self,response):
        print "=====> Start data extract ...."
        yield BomnegocioItem(
            title=response.xpath("//*[@id='ad_title']/text()").extract()[0]
        )
        print "=====> Finish data extract."

Note also how the print statements are now inside the callback and the usage of yield instead of return (which allows you to generate several items from one page).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM