简体   繁体   English

如何将大型JSON文件转换为Pandas Dataframe或常规CSV文件?

[英]How do I convert a large JSON file to a Pandas Dataframe or a regular CSV file?

I've tried json_normalize, and this seems to work; 我已经尝试过json_normalize,这似乎可行; however, it does not print my desired output. 但是,它不会打印我想要的输出。

import requests
import json
from pandas.io.json import json_normalize
import pandas as pd

url = "https://www.qnt.io/api/results?pID=gifgif&mID=54a309ae1c61be23aba0da62&key=54a309ac1c61be23aba0da3f"

aResponse = requests.get(url)



y = json.loads(aResponse.content)
json_test = json.dumps(y, indent = 4, sort_keys=True)
print(json_test)
csv = json_normalize(y['results'])
print(csv)

Displaying the output of this code is difficult and extremely confusing; 显示此代码的输出非常困难,而且非常混乱。 therefore, I think its in both of our best interests that I leave it out. 因此,我认为出于我们的最大利益,我将其排除在外。 If that is a useful piece of information, I can add it. 如果那是有用的信息,我可以添加它。

The json.dumps portion simply orgranizes my json file so that it is easily viewable. json.dumps部分只是对我的json文件进行整理,以便于查看。 Unfortunately, I can't post the entire json file because Stack isn't a huge fan of my formatting. 不幸的是,我无法发布整个json文件,因为Stack并不是我格式化的忠实粉丝。 Here is a small snippet: 这是一个小片段:

{
"query_parameters": {
    "limit": 10,
    "mID": "54a309ae1c61be23aba0da62",
    "skip": 0,
    "sort": 1
},
"results": [
    {
        "cID": "5314ab42d34b6c5b402aead4",
        "content": "BE9kUwvLfsAmI",
        "content_data": {
            "added_with_admin": false,
            "dateAdded": 1393863490.072894,
            "embedLink": "http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif",
            "still_image": "http://media.giphy.com/media/BE9kUwvLfsAmI/200_s.gif",
            "tags": [
                "adam levine",
                "embarassed",
                "the voice",
                "confession"
            ]
        },
        "content_type": "gif",
        "index": 269,
        "parameters": {
            "mu": 35.92818823777915,
            "sigma": 1.88084276812386
        },
        "rank": 0
    },

There is about 10 more of these (ranging all the way up to 6119; however, I'm trying to get just part of this working). 其中大约有10个以上(一直到6119;但是,我正试图让其中一部分工作)。 I want my output to be ordered as such: rank, tags, embedLink, mu, sigma, index. 我希望我的输出按以下顺序排序:等级,标签,embedLink,mu,sigma,索引。 Here is an example of my desired output: 这是我想要的输出的示例:

0, adam levine, embarassed, the voice, confession, http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif, 35.92818823777915, 1.88084276812386, 269

I would like to have it as a csv file; 我想把它作为一个csv文件; however, I think creating a dataframe using Pandas could also be quite useful. 但是,我认为使用Pandas创建数据框也可能非常有用。 I think my problem occurs because I have such a large, embedded json file, and it's hard for the computer to organize this large data-set. 我认为出现问题是因为我有一个很大的嵌入式json文件,计算机很难组织这么大的数据集。 Any advice would be appreciated! 任何意见,将不胜感激!

First, you can use requests.json() instead of requests.text to get the response content as JSON. 首先,可以使用requests.json()而不是requests.text来获取响应内容作为JSON。

import requests
import pandas as pd
from pprint import pprint

url = "https://www.qnt.io/api/results?pID=gifgif&mID=54a309ae1c61be23aba0da62&key=54a309ac1c61be23aba0da3f"

response = requests.get(url)
results = response.json()["results"]

# pprint(results)

[{'cID': '5314ab42d34b6c5b402aead4',
  'content': 'BE9kUwvLfsAmI',
  'content_data': {'added_with_admin': False,
                   'dateAdded': 1393863490.072894,
                   'embedLink': 'http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif',
                   'still_image': 'http://media.giphy.com/media/BE9kUwvLfsAmI/200_s.gif',
                   'tags': ['adam levine',
                            'embarassed',
                            'the voice',
                            'confession']},
  'content_type': 'gif',
  'index': 269,
  'parameters': {'mu': 35.92818823777915, 'sigma': 1.88084276812386},
  'rank': 0},
 {'cID': '5314ab4dd34b6c5b402aeb97',
  ...

Then you can load the dict with pd.DataFrame.from_dict : 然后您可以使用pd.DataFrame.from_dict加载dict:

df = pd.DataFrame.from_dict(results)

# print(df.head(2))

                        cID        content  \
0  5314ab42d34b6c5b402aead4  BE9kUwvLfsAmI   
1  5314ab4dd34b6c5b402aeb97  NZhO1SEuFmhj2   

                                        content_data content_type  index  \
0  {'embedLink': 'http://media3.giphy.com/media/B...          gif    269   
1  {'embedLink': 'http://media1.giphy.com/media/N...          gif    464   

                                          parameters  rank  
0  {'mu': 35.92818823777915, 'sigma': 1.880842768...     0  
1  {'mu': 35.70238333972232, 'sigma': 1.568292935...     1  

And then use .apply(pd.Series) to further expand the columns in dict: 然后使用.apply(pd.Series)进一步扩展dict中的列:

df = pd.concat([df.drop(["content_data"], axis=1), df["content_data"].apply(pd.Series)], axis=1)
df = pd.concat([df.drop(["parameters"], axis=1), df["parameters"].apply(pd.Series)], axis=1)

# print(df.head(2))
                        cID        content content_type  index  rank  \
0  5314ab42d34b6c5b402aead4  BE9kUwvLfsAmI          gif    269     0   
1  5314ab4dd34b6c5b402aeb97  NZhO1SEuFmhj2          gif    464     1   

   added_with_admin     dateAdded  \
0             False  1.393863e+09   
1             False  1.393864e+09   

                                           embedLink  \
0  http://media3.giphy.com/media/BE9kUwvLfsAmI/gi...   
1  http://media1.giphy.com/media/NZhO1SEuFmhj2/gi...   

                                         still_image  \
0  http://media.giphy.com/media/BE9kUwvLfsAmI/200...   
1  http://media.giphy.com/media/NZhO1SEuFmhj2/200...   

                                                tags         mu     sigma  
0   [adam levine, embarassed, the voice, confession]  35.928188  1.880843  
1  [ryan gosling, facepalm, embarrassed, confession]  35.702383  1.568293

And convert the tags from list to string: 并将标签从列表转换为字符串:

df["tags"] = df["tags"].apply(lambda x: ", ".join(x))

# print(df.head(2)["tags"])

0     adam levine, embarassed, the voice, confession
1    ryan gosling, facepalm, embarrassed, confession

And get the columns you want finally: 并最终获得所需的列:

df = df[["rank", "tags", "embedLink", "mu", "sigma", "index"]]

# print(df.head(2))

   rank                                             tags  \
0     0   adam levine, embarassed, the voice, confession   
1     1  ryan gosling, facepalm, embarrassed, confession   

                                           embedLink         mu     sigma  \
0  http://media3.giphy.com/media/BE9kUwvLfsAmI/gi...  35.928188  1.880843   
1  http://media1.giphy.com/media/NZhO1SEuFmhj2/gi...  35.702383  1.568293   

   index  
0    269  
1    464

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM