简体   繁体   中英

Downloading a csv file using Scrapy - Python

I'm trying to download a CSV file using Scrapy 1.3.2 and Python 2.7.13, without any luck so far.

Here is the code of the spider:

import scrapy

class FinancialFilesItem(scrapy.Item):
        file_urls = scrapy.Field()
        files = scrapy.Field()

class FinancialsSpider(scrapy.Spider):
    name = "Financials Spider"
    allowed_domains = ["financials.morningstar.com"]

    def __init__(self, url):
        super(FinancialsSpider, self).__init__()
        self.start_urls = url

    def parse(self, response):

        result = FinancialFilesItem()

        result['file_urls'] = [response.url]
        yield result

And here the main code where the spider is called:

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scraper.spiders.financialsSpider import FinancialsSpider


def GetFinancials(url):

    settings = Settings()

    settings.set('ITEM_PIPELINES', {'scrapy.pipelines.files.FilesPipeline': 1})
    settings.set('FILES_STORE', 'D:/downloads/')

    process = CrawlerProcess(settings)

    spider = FinancialsSpider

    process.crawl(spider, url = url)
    process.start()

GetFinancials(["http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB"])

Here is the log when the main code is run:

2017-02-18 15:22:38 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot)
2017-02-18 15:22:38 [scrapy.utils.log] INFO: Overridden settings: {}
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-02-18 15:22:38 [scrapy.core.engine] INFO: Spider opened
2017-02-18 15:22:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-18 15:22:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-02-18 15:22:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> (referer: None)
2017-02-18 15:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> (referer: None)
2017-02-18 15:22:40 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> referred in <None>
2017-02-18 15:22:40 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> referred in <None>
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 389, in file_downloaded
    self.store.persist_file(path, buf, info)
  File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 54, in persist_file
    with open(absolute_path, 'wb') as f:
IOError: [Errno 22] invalid mode ('wb') or filename: 'D:/full\\01958104292b4813abcda051da56e55e72d22fb9.html?t=FB'
2017-02-18 15:22:40 [scrapy.core.scraper] DEBUG: Scraped from <200 http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB>
{'file_urls': ['http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB'],
 'files': []}
2017-02-18 15:22:40 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-18 15:22:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 555,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 5970,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'file_count': 1,
 'file_status_count/downloaded': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 2, 18, 14, 22, 40, 160000),
 'item_scraped_count': 1,
 'log_count/DEBUG': 5,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 2, 18, 14, 22, 38, 826000)}
2017-02-18 15:22:40 [scrapy.core.engine] INFO: Spider closed (finished)

Thanks for your answers.

您是否尝试输出到CSV?

scrapy crawl nameofspider -o file.csv

It's in log:

IOError: [Errno 22] invalid mode ('wb') or filename: 'D:/full\\01958104292b4813abcda051da56e55e72d22fb9.html?t=FB'

change path to this as you are on Windows

settings.set('FILES_STORE', 'D:\\downloads')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM