My question is the following : scrapy export empty csv.
My code structural shape :
items.py :
import scrapy
class BomnegocioItem(scrapy.Item):
title = scrapy.Field()
pass
pipelines.py :
class BomnegocioPipeline(object):
def process_item(self, item, spider):
return item
settings.py:
BOT_NAME = 'bomnegocio'
SPIDER_MODULES = ['bomnegocio.spiders']
NEWSPIDER_MODULE = 'bomnegocio.spiders'
LOG_ENABLED = True
bomnegocioSpider.py :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from bomnegocio.items import BomnegocioItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
import csv
import urllib2
class bomnegocioSpider(CrawlSpider):
name = 'bomnegocio'
allowed_domains = ["http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"]
start_urls = [
"http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
]
rules = (Rule (SgmlLinkExtractor(allow=r'/fogao')
, callback="parse_bomnegocio", follow= True),
)
print "=====> Start data extract ...."
def parse_bomnegocio(self,response):
#hxs = HtmlXPathSelector(response)
#items = []
item = BomnegocioItem()
item['title'] = response.xpath("//*[@id='ad_title']/text()").extract()[0]
#items.append(item)
return item
print "=====> Finish data extract."
#//*[@id="ad_title"]
terminal :
$ scrapy crawl bomnegocio -o dataextract.csv -t csv
=====> Start data extract ....
=====> Finish data extract.
2014-12-12 13:38:45-0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: bomnegocio)
2014-12-12 13:38:45-0200 [scrapy] INFO: Optional features available: ssl, http11
2014-12-12 13:38:45-0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bomnegocio.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['bomnegocio.spiders'], 'FEED_URI': 'dataextract.csv', 'BOT_NAME': 'bomnegocio'}
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled item pipelines:
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider opened
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Crawled (200) <GET http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713> (referer: None)
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?t=&u=http%3A%2F%2Fsp.bomnegocio.com%2Fregiao-de-bauru-e-marilia%2Feletrodomesticos%2Ffogao-industrial-itajobi-4-bocas-c-forno-54183713>
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Closing spider (finished)
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 308,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 8503,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 538024),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 119067)}
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider closed (finished)
Why ?
===> 2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
$ nano dataextract.csv
Look's empty. =(
I do some hypotheses :
i) My crawl sentence provide wrong xpath ? I go to terminal and type
$ scrapy shell "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
>>> response.xpath("//*[@id='ad_title']/text()").extract()[0]
u'\n\t\t\t\n\t\t\t\tFog\xe3o industrial itajobi 4 bocas c/ forno \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t- '
Answer : No, the problem is not in the xpath sentence
ii) Mys "import" ? On the log view do not show "import"problems.
Thank you for your attention and I now look forward to hearing your views.
There are a few issues with this spider:
1) allowed_domains
is meant to be used for domains, so you want to use:
allowed_domains = ["bomnegocio.com"]
2) The usage of the rules are not very adequate here, because they are meant for defining how the site should be crawled -- which links to follow. In this case, you don't need to follow any links, you just want to scrape the data directly from the URLs you're listing in start_urls
, so I suggest you just get rid of the rules
attribute, make the spider extend scrapy.Spider
instead and scrape the data in the default callback parse
:
from testing.items import BomnegocioItem
import scrapy
class bomnegocioSpider(scrapy.Spider):
name = 'bomnegocio'
allowed_domains = ["bomnegocio.com"]
start_urls = [
"http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
]
def parse(self,response):
print "=====> Start data extract ...."
yield BomnegocioItem(
title=response.xpath("//*[@id='ad_title']/text()").extract()[0]
)
print "=====> Finish data extract."
Note also how the print statements are now inside the callback and the usage of yield
instead of return
(which allows you to generate several items from one page).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.