简体   繁体   English

如何在scrapy中存储json文件?

[英]How to store json file in scrapy?

I am currently using Scrapy to crawl some domains from different website and I wonder how to save my data in a local json file with the format of either a list or a dictionary with the key of 'domain' and a list of domains as value. 我目前正在使用Scrapy从不同的网站爬网一些域,我想知道如何将我的数据保存在本地json文件中,其格式为列表或字典,键为“ domain”,并将域列表作为值。

In the crawler file, the item is like this: 在搜寻器文件中,该项目如下所示:

item['domain'] = 'xxx'.extract()
yield item

import json
import codecs

class ChinazPipeline(object):

    def __init__(self):
        self.file = codecs.open('save.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

What I expect is: 我期望的是:

{"domain": "['google.com', 'cnn.com', 'yahoo.com']"}

or just simply save all domains that I crawled as a list in json, either way works for me. 或只是将我抓取的所有域都保存为json中的列表,无论哪种方式对我都有效。

It's rather simple. 这很简单。 Json is default scrapy exporter. Json是默认的scrapy出口商。 You can use it by turning on output to JSON file: 您可以通过打开JSON文件的输出来使用它:

scrapy runspider yourspider.py -o filename.json 

Scrapy will automatically determine format you with to have by file type. Scrapy会自动确定文件类型的格式。 Other options are .csv and .jsonline . 其他选项是.csv.jsonline

It's an easy way. 这是一个简单的方法。 Otherwize you can write your own ItemExporter . 另外,您可以编写自己的ItemExporter Take a look at exporters documentation . 看看出口商的文件

NB: 注意:

You don't even need to open file during spider initiation, scrapy will manage it by itself. 您甚至不需要在启动Spider期间打开文件,scrapy会自行对其进行管理。 Just yield items and scrapy will write it to file automatically. 只要屈服物品,scrapy就会自动将其写入文件。


Scrapy is most suitable for one page -> one item schema. Scrapy最适合one page -> one item模式。 What you want is scrape all items in advance and then export them as single list. 您想要的是预先刮取所有项目,然后将它们导出为单个列表。

So you should have some variable like self.results , append there new domains from every process_item() call. 因此,您应该有一些变量,例如self.results ,在每个process_item()调用中都添加新的域。 And then export it on spider close event. 然后在蜘蛛关闭事件中将其导出。 There's shortcut for this signal. 这个信号有捷径。 So you can just add: 因此,您可以添加:

def closed(self, reason):
    # write self.results list to JSON file.

More documentation on Spider.closed() method . 有关Spider.closed()方法的更多文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM