简体   繁体   中英

How to export Scrapy's result to a specical JSON format?

I use Scrapy to crawl and scrap StackOverflow.com . This is so.py

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'link': response.url,
        }

Expected result: so.json (valid JSON format)

[
   "http://stackoverflow.com/questions/36421917/exponential-number-in-custom-number-format-of-excel",
   "http://stackoverflow.com/questions/36421343/can-not-install-requirements-txt",
   "http://stackoverflow.com/questions/36418815/difference-between-two-approaches-to-pass-parameters-to-web-server",
   "http://stackoverflow.com/questions/36421743/sharing-an-oracle-database-connection-between-simultaneous-celery-tasks",
   "http://stackoverflow.com/questions/36421941/jquery-add-css-style",
]

Then run:

scrapy runspider so.py -o so.json

The result isn't like above expected. I stuck at here.

Try to use a FEED_FORMAT=jsonlines setting.

scrapy runspider so.py -o so.json --set FEED_FORMAT=jsonlines

If you want to get

[
   "https://stackoverflow.com/questions/36421917/exponential-number-in-custom-number-format-of-excel",
   "https://stackoverflow.com/questions/36421343/can-not-install-requirements-txt",
   "https://stackoverflow.com/questions/36418815/difference-between-two-approaches-to-pass-parameters-to-web-server",
   "https://stackoverflow.com/questions/36421743/sharing-an-oracle-database-connection-between-simultaneous-celery-tasks",
   "https://stackoverflow.com/questions/36421941/jquery-add-css-style",
]

you should write your own ItemExporter, see this question

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM