[英]How do I convert a large JSON file to a Pandas Dataframe or a regular CSV file?
I've tried json_normalize, and this seems to work; 我已经尝试过json_normalize,这似乎可行; however, it does not print my desired output.
但是,它不会打印我想要的输出。
import requests
import json
from pandas.io.json import json_normalize
import pandas as pd
url = "https://www.qnt.io/api/results?pID=gifgif&mID=54a309ae1c61be23aba0da62&key=54a309ac1c61be23aba0da3f"
aResponse = requests.get(url)
y = json.loads(aResponse.content)
json_test = json.dumps(y, indent = 4, sort_keys=True)
print(json_test)
csv = json_normalize(y['results'])
print(csv)
Displaying the output of this code is difficult and extremely confusing; 显示此代码的输出非常困难,而且非常混乱。 therefore, I think its in both of our best interests that I leave it out.
因此,我认为出于我们的最大利益,我将其排除在外。 If that is a useful piece of information, I can add it.
如果那是有用的信息,我可以添加它。
The json.dumps portion simply orgranizes my json file so that it is easily viewable. json.dumps部分只是对我的json文件进行整理,以便于查看。 Unfortunately, I can't post the entire json file because Stack isn't a huge fan of my formatting.
不幸的是,我无法发布整个json文件,因为Stack并不是我格式化的忠实粉丝。 Here is a small snippet:
这是一个小片段:
{
"query_parameters": {
"limit": 10,
"mID": "54a309ae1c61be23aba0da62",
"skip": 0,
"sort": 1
},
"results": [
{
"cID": "5314ab42d34b6c5b402aead4",
"content": "BE9kUwvLfsAmI",
"content_data": {
"added_with_admin": false,
"dateAdded": 1393863490.072894,
"embedLink": "http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif",
"still_image": "http://media.giphy.com/media/BE9kUwvLfsAmI/200_s.gif",
"tags": [
"adam levine",
"embarassed",
"the voice",
"confession"
]
},
"content_type": "gif",
"index": 269,
"parameters": {
"mu": 35.92818823777915,
"sigma": 1.88084276812386
},
"rank": 0
},
There is about 10 more of these (ranging all the way up to 6119; however, I'm trying to get just part of this working). 其中大约有10个以上(一直到6119;但是,我正试图让其中一部分工作)。 I want my output to be ordered as such: rank, tags, embedLink, mu, sigma, index.
我希望我的输出按以下顺序排序:等级,标签,embedLink,mu,sigma,索引。 Here is an example of my desired output:
这是我想要的输出的示例:
0, adam levine, embarassed, the voice, confession, http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif, 35.92818823777915, 1.88084276812386, 269
I would like to have it as a csv file; 我想把它作为一个csv文件; however, I think creating a dataframe using Pandas could also be quite useful.
但是,我认为使用Pandas创建数据框也可能非常有用。 I think my problem occurs because I have such a large, embedded json file, and it's hard for the computer to organize this large data-set.
我认为出现问题是因为我有一个很大的嵌入式json文件,计算机很难组织这么大的数据集。 Any advice would be appreciated!
任何意见,将不胜感激!
First, you can use requests.json() instead of requests.text
to get the response content as JSON. 首先,可以使用requests.json()而不是
requests.text
来获取响应内容作为JSON。
import requests
import pandas as pd
from pprint import pprint
url = "https://www.qnt.io/api/results?pID=gifgif&mID=54a309ae1c61be23aba0da62&key=54a309ac1c61be23aba0da3f"
response = requests.get(url)
results = response.json()["results"]
# pprint(results)
[{'cID': '5314ab42d34b6c5b402aead4',
'content': 'BE9kUwvLfsAmI',
'content_data': {'added_with_admin': False,
'dateAdded': 1393863490.072894,
'embedLink': 'http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif',
'still_image': 'http://media.giphy.com/media/BE9kUwvLfsAmI/200_s.gif',
'tags': ['adam levine',
'embarassed',
'the voice',
'confession']},
'content_type': 'gif',
'index': 269,
'parameters': {'mu': 35.92818823777915, 'sigma': 1.88084276812386},
'rank': 0},
{'cID': '5314ab4dd34b6c5b402aeb97',
...
Then you can load the dict with pd.DataFrame.from_dict : 然后您可以使用pd.DataFrame.from_dict加载dict:
df = pd.DataFrame.from_dict(results)
# print(df.head(2))
cID content \
0 5314ab42d34b6c5b402aead4 BE9kUwvLfsAmI
1 5314ab4dd34b6c5b402aeb97 NZhO1SEuFmhj2
content_data content_type index \
0 {'embedLink': 'http://media3.giphy.com/media/B... gif 269
1 {'embedLink': 'http://media1.giphy.com/media/N... gif 464
parameters rank
0 {'mu': 35.92818823777915, 'sigma': 1.880842768... 0
1 {'mu': 35.70238333972232, 'sigma': 1.568292935... 1
And then use .apply(pd.Series)
to further expand the columns in dict: 然后使用
.apply(pd.Series)
进一步扩展dict中的列:
df = pd.concat([df.drop(["content_data"], axis=1), df["content_data"].apply(pd.Series)], axis=1)
df = pd.concat([df.drop(["parameters"], axis=1), df["parameters"].apply(pd.Series)], axis=1)
# print(df.head(2))
cID content content_type index rank \
0 5314ab42d34b6c5b402aead4 BE9kUwvLfsAmI gif 269 0
1 5314ab4dd34b6c5b402aeb97 NZhO1SEuFmhj2 gif 464 1
added_with_admin dateAdded \
0 False 1.393863e+09
1 False 1.393864e+09
embedLink \
0 http://media3.giphy.com/media/BE9kUwvLfsAmI/gi...
1 http://media1.giphy.com/media/NZhO1SEuFmhj2/gi...
still_image \
0 http://media.giphy.com/media/BE9kUwvLfsAmI/200...
1 http://media.giphy.com/media/NZhO1SEuFmhj2/200...
tags mu sigma
0 [adam levine, embarassed, the voice, confession] 35.928188 1.880843
1 [ryan gosling, facepalm, embarrassed, confession] 35.702383 1.568293
And convert the tags from list to string: 并将标签从列表转换为字符串:
df["tags"] = df["tags"].apply(lambda x: ", ".join(x))
# print(df.head(2)["tags"])
0 adam levine, embarassed, the voice, confession
1 ryan gosling, facepalm, embarrassed, confession
And get the columns you want finally: 并最终获得所需的列:
df = df[["rank", "tags", "embedLink", "mu", "sigma", "index"]]
# print(df.head(2))
rank tags \
0 0 adam levine, embarassed, the voice, confession
1 1 ryan gosling, facepalm, embarrassed, confession
embedLink mu sigma \
0 http://media3.giphy.com/media/BE9kUwvLfsAmI/gi... 35.928188 1.880843
1 http://media1.giphy.com/media/NZhO1SEuFmhj2/gi... 35.702383 1.568293
index
0 269
1 464
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.