简体   繁体   English

将Pandas DataFrame和元数据保存为JSON格式

[英]Saving Pandas DataFrame and meta-data to JSON format

I have a need to save a Pandas DataFrame, along with some metadata to a file in JSON format. 我需要将Pandas DataFrame以及一些元数据保存到JSON格式的文件中。 (The JSON format is a requirement.) (JSON格式是必需的。)

Background 背景
A) I can successfully read/write my rather large Pandas Dataframe from/to JSON using DataFrame.to_json() and DataFrame.from_json() . A)我可以使用DataFrame.to_json()DataFrame.from_json()从JSON成功读取/写入我的大熊猫Dataframe。 No problems. 没问题。

B) I have no problems saving my metadata (dict) to JSON using json.dump() / json.load() B)我没有问题可以使用json.dump() / json.load()元数据(dict)保存到JSON


My first attempt 我的第一次尝试
Since Pandas does not support DataFrame metadata directly, my first thought was to 由于Pandas不直接支持DataFrame元数据,所以我首先想到的是

top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
json.dump(top_level_dict, fp)


Failure modes 失败模式
C) I have found that even the simplified case of C)我发现即使是简化的情况

df_dict = df.to_dict()
json.dump(df_dict, fp)

fails with: 失败与:

TypeError: key (u'US', 112, 5, 80, 'wl') is not a string

D) Investigating, I've found that the complement also fails. D)调查中,我发现补码也失败了。

df.to_json(fp)
json.load(fp)

fails with 失败于

384             raise ValueError("No JSON object could be decoded")
ValueError: Expecting : delimiter: line 1 column 17 (char 16)

So it appears that Pandas JSON format and the Python's JSON library are not compatible. 因此,似乎Pandas JSON格式与Python的JSON库不兼容。

My first thought is to chase down a way to modify the df.to_dict() output of C to make it amenable to Python's JSON library, but I keep hearing "If you're struggling to do something in Python, you're probably doing it wrong." 我的第一个想法是寻找一种方法来修改Cdf.to_dict()输出以使其适合Python的JSON库,但我一直在听:“如果您在Python中做某事时很挣扎,那么您可能正在做错了。” in my head. 在我脑海里。


Question
What is the cannonical/recommended method for adding metadata to a Pandas DataFrame and storing to a JSON-formatted file? 将元数据添加到Pandas DataFrame并存储到JSON格式的文件的规范/推荐方法是什么?

Python 2.7.10 Python 2.7.10
Pandas 0.17 熊猫0.17

Edit 1: 编辑1:
While trying out Evan Wright's great answer, I found the source of my problems: Pandas (as of 0.17) does not like saving Multi-Indexed DataFrames to JSON. 在尝试Evan Wright的出色答案时,我发现了问题的根源:Pandas(自0.17开始)不喜欢将多索引DataFrames保存为JSON。 The library I had created to save my (Multi-Indexed) DataFrames is quietly performing a df.reset_index() before calling DataFrame.to_json() . 我创建的用于保存我的(多索引)DataFrame的df.reset_index()在调用DataFrame.to_json()之前正在悄悄执行DataFrame.to_json() My newer code was not. 我的新代码不是。 So it was DataFrame.to_json() burping on the MultiIndex. 因此是在DataFrame.to_json()

Lesson: Read the documentation kids, even when it's your own documentation. 课程:即使是您自己的文档,也请阅读文档儿童。

Edit 2: 编辑2:

If you need to store both the DataFrame and the metadata in a single JSON object, see my answer below. 如果您需要将DataFrame和元数据都存储在单个 JSON对象中,请参见下面的答案。

You should be able to just put the data on separate lines. 您应该能够将数据放在单独的行上。

Writing: 写作:

f = open('test.json', 'w')
df.to_json(f)
print >> f
json.dump(metadata, f)

Reading: 读:

f = open('test.json')
df = pd.read_json(next(f))
metdata = json.loads(next(f))

In my question, I erroneously stated that I needed the JSON in a file. 在我的问题中,我错误地指出我需要文件中的JSON。 In that situation, Evan Wright's answer is my preferred solution. 在这种情况下,Evan Wright的答案是我的首选解决方案。

In my case, I actually need to store the JSON output as a single "blob" in a database, so my dictionary-wrangling approach appears to be necessary. 就我而言,我实际上需要将JSON输出作为单个“ blob”存储在数据库中,因此似乎需要使用字典整理方法。

If you similarly need to store the data and metadata in a single JSON blob, the following code will work: 如果您同样需要将数据和元数据存储在单个JSON Blob中,则以下代码将起作用:

top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
with open(FILENAME, 'w') as outfile:
    json.dump(top_level_dict, outfile)

Just make sure DataFrame is singly-indexed. 只要确保DataFrame单独索引即可。 If it's Multi-Indexed, reset the index (ie df.reset_index() ) before doing the above. 如果它是多索引的,请在执行上述操作之前重置索引(即df.reset_index() )。

Reading the data back in: 读回数据:

with open(FILENAME, 'r') as infile:
    top_level_dict = json.load(infile)

df_as_dict = top_level_dict.pop('data', {})
df = pandas.DataFrame().as_dict(df_as_dict)

meta = top_level_dict['metadata']

At this point, you'll need to re-create your Multi-Index (if applicable) 此时,您需要重新创建多索引(如果适用)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM