简体   繁体   中英

Scrapy: How to output items in a specific json format

I output the scraped data in json format. Default scrapy exporter outputs list of dict in json format. Item type looks like:

[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]

But I want to export the data in a specific format like this:

{
"Shop Name":"Shop 1",
"Location":"XXXXXXXXX",
"Contact":"XXXX-XXXXX",
"Products":
[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]
}

Please advice me any solution. Thank you.

This is well documented at scrapy web page here .

from scrapy.exporters import JsonItemExporter


class ItemPipeline(object):

    file = None

    def open_spider(self, spider):
        self.file = open('item.json', 'w')
        self.exporter = JsonItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

This will create a json file containing your items.

I was trying to export pretty printed JSON and this is what worked for me.

I created a pipeline that looked like this:

class JsonPipeline(object):

    def open_spider(self, spider):
        self.file = open('your_file_name.json', 'wb')
        self.file.write("[")

    def close_spider(self, spider):
        self.file.write("]")
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(
            dict(item),
            sort_keys=True,
            indent=4,
            separators=(',', ': ')
        ) + ",\n"

        self.file.write(line)
        return item

It's similar to the example from the scrapy docs https://doc.scrapy.org/en/latest/topics/item-pipeline.html except it prints each JSON property indented and on a new line.

See the part about pretty printing here https://docs.python.org/2/library/json.html

还有另一种可能的解决方案是直接从命令行直接从蜘蛛生成 json 中的蜘蛛输出。

scrapy crawl "name_of_your_spider" -a NAME_OF_ANY_ARGUMENT=VALUE_OF_THE_ARGUMENT -o output_data.json

Another way to take a json export of the scraped/crawled output from a scrapy spider is to enable feed export which is one of the inherent, inbuilt capabilities that are offered in the scrapy classes which could be enabled or disabled as per the requirements. One can go about doing this by defining the custom_settings (overriding) of the spider in the following manner. This ends up overriding the overall scrapy project settings for this specific spider.

So, for any spider named 'sample_spider':

class SampleSpider(scrapy.Spider):
    name = "sample_spider"
    allowed_domains = []

    custom_settings = {
        'FEED_URI': 'sample_spider_crawled_data.json',
        'FEED_FORMAT': 'json',
        'FEED_EXPORTERS': {
            'json': 'scrapy.exporters.JsonItemExporter',
        },
        'FEED_EXPORT_ENCODING': 'utf-8',
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM