简体   繁体   中英

How do I convert a large JSON file to a Pandas Dataframe or a regular CSV file?

I've tried json_normalize, and this seems to work; however, it does not print my desired output.

import requests
import json
from pandas.io.json import json_normalize
import pandas as pd

url = "https://www.qnt.io/api/results?pID=gifgif&mID=54a309ae1c61be23aba0da62&key=54a309ac1c61be23aba0da3f"

aResponse = requests.get(url)



y = json.loads(aResponse.content)
json_test = json.dumps(y, indent = 4, sort_keys=True)
print(json_test)
csv = json_normalize(y['results'])
print(csv)

Displaying the output of this code is difficult and extremely confusing; therefore, I think its in both of our best interests that I leave it out. If that is a useful piece of information, I can add it.

The json.dumps portion simply orgranizes my json file so that it is easily viewable. Unfortunately, I can't post the entire json file because Stack isn't a huge fan of my formatting. Here is a small snippet:

{
"query_parameters": {
    "limit": 10,
    "mID": "54a309ae1c61be23aba0da62",
    "skip": 0,
    "sort": 1
},
"results": [
    {
        "cID": "5314ab42d34b6c5b402aead4",
        "content": "BE9kUwvLfsAmI",
        "content_data": {
            "added_with_admin": false,
            "dateAdded": 1393863490.072894,
            "embedLink": "http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif",
            "still_image": "http://media.giphy.com/media/BE9kUwvLfsAmI/200_s.gif",
            "tags": [
                "adam levine",
                "embarassed",
                "the voice",
                "confession"
            ]
        },
        "content_type": "gif",
        "index": 269,
        "parameters": {
            "mu": 35.92818823777915,
            "sigma": 1.88084276812386
        },
        "rank": 0
    },

There is about 10 more of these (ranging all the way up to 6119; however, I'm trying to get just part of this working). I want my output to be ordered as such: rank, tags, embedLink, mu, sigma, index. Here is an example of my desired output:

0, adam levine, embarassed, the voice, confession, http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif, 35.92818823777915, 1.88084276812386, 269

I would like to have it as a csv file; however, I think creating a dataframe using Pandas could also be quite useful. I think my problem occurs because I have such a large, embedded json file, and it's hard for the computer to organize this large data-set. Any advice would be appreciated!

First, you can use requests.json() instead of requests.text to get the response content as JSON.

import requests
import pandas as pd
from pprint import pprint

url = "https://www.qnt.io/api/results?pID=gifgif&mID=54a309ae1c61be23aba0da62&key=54a309ac1c61be23aba0da3f"

response = requests.get(url)
results = response.json()["results"]

# pprint(results)

[{'cID': '5314ab42d34b6c5b402aead4',
  'content': 'BE9kUwvLfsAmI',
  'content_data': {'added_with_admin': False,
                   'dateAdded': 1393863490.072894,
                   'embedLink': 'http://media3.giphy.com/media/BE9kUwvLfsAmI/giphy.gif',
                   'still_image': 'http://media.giphy.com/media/BE9kUwvLfsAmI/200_s.gif',
                   'tags': ['adam levine',
                            'embarassed',
                            'the voice',
                            'confession']},
  'content_type': 'gif',
  'index': 269,
  'parameters': {'mu': 35.92818823777915, 'sigma': 1.88084276812386},
  'rank': 0},
 {'cID': '5314ab4dd34b6c5b402aeb97',
  ...

Then you can load the dict with pd.DataFrame.from_dict :

df = pd.DataFrame.from_dict(results)

# print(df.head(2))

                        cID        content  \
0  5314ab42d34b6c5b402aead4  BE9kUwvLfsAmI   
1  5314ab4dd34b6c5b402aeb97  NZhO1SEuFmhj2   

                                        content_data content_type  index  \
0  {'embedLink': 'http://media3.giphy.com/media/B...          gif    269   
1  {'embedLink': 'http://media1.giphy.com/media/N...          gif    464   

                                          parameters  rank  
0  {'mu': 35.92818823777915, 'sigma': 1.880842768...     0  
1  {'mu': 35.70238333972232, 'sigma': 1.568292935...     1  

And then use .apply(pd.Series) to further expand the columns in dict:

df = pd.concat([df.drop(["content_data"], axis=1), df["content_data"].apply(pd.Series)], axis=1)
df = pd.concat([df.drop(["parameters"], axis=1), df["parameters"].apply(pd.Series)], axis=1)

# print(df.head(2))
                        cID        content content_type  index  rank  \
0  5314ab42d34b6c5b402aead4  BE9kUwvLfsAmI          gif    269     0   
1  5314ab4dd34b6c5b402aeb97  NZhO1SEuFmhj2          gif    464     1   

   added_with_admin     dateAdded  \
0             False  1.393863e+09   
1             False  1.393864e+09   

                                           embedLink  \
0  http://media3.giphy.com/media/BE9kUwvLfsAmI/gi...   
1  http://media1.giphy.com/media/NZhO1SEuFmhj2/gi...   

                                         still_image  \
0  http://media.giphy.com/media/BE9kUwvLfsAmI/200...   
1  http://media.giphy.com/media/NZhO1SEuFmhj2/200...   

                                                tags         mu     sigma  
0   [adam levine, embarassed, the voice, confession]  35.928188  1.880843  
1  [ryan gosling, facepalm, embarrassed, confession]  35.702383  1.568293

And convert the tags from list to string:

df["tags"] = df["tags"].apply(lambda x: ", ".join(x))

# print(df.head(2)["tags"])

0     adam levine, embarassed, the voice, confession
1    ryan gosling, facepalm, embarrassed, confession

And get the columns you want finally:

df = df[["rank", "tags", "embedLink", "mu", "sigma", "index"]]

# print(df.head(2))

   rank                                             tags  \
0     0   adam levine, embarassed, the voice, confession   
1     1  ryan gosling, facepalm, embarrassed, confession   

                                           embedLink         mu     sigma  \
0  http://media3.giphy.com/media/BE9kUwvLfsAmI/gi...  35.928188  1.880843   
1  http://media1.giphy.com/media/NZhO1SEuFmhj2/gi...  35.702383  1.568293   

   index  
0    269  
1    464

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM