简体   繁体   English

如何将任何嵌套的json转换为pandas数据框

[英]How to convert any nested json into a pandas dataframe

I'm currently working on a project that will be analyzing multiple data sources for information, other data sources are fine but I am having a lot of trouble with json and its sometimes deeply nested structure. 我目前正在一个项目中,该项目将分析多个数据源以获取信息,其他数据源也很好,但是json及其有时深度嵌套的结构给我带来很多麻烦。 I have tried to turn the json into a python dictionary, but with not much luck as it can start to struggle as it gets more complicated. 我曾尝试将json转换成python字典,但运气不佳,因为它变得越来越复杂,可能会开始挣扎。 For example with this sample json file: 例如,此示例json文件:

{
  "Employees": [
    {
      "userId": "rirani",
      "jobTitleName": "Developer",
      "firstName": "Romin",
      "lastName": "Irani",
      "preferredFullName": "Romin Irani",
      "employeeCode": "E1",
      "region": "CA",
      "phoneNumber": "408-1234567",
      "emailAddress": "romin.k.irani@gmail.com"
    },
    {
      "userId": "nirani",
      "jobTitleName": "Developer",
      "firstName": "Neil",
      "lastName": "Irani",
      "preferredFullName": "Neil Irani",
      "employeeCode": "E2",
      "region": "CA",
      "phoneNumber": "408-1111111",
      "emailAddress": "neilrirani@gmail.com"
    }
  ]
}

after converting to dictionary and doing dict.keys() only returns "Employees". 转换为字典并执行dict.keys()仅返回“ Employees”。 I then resorted to instead opt for a pandas dataframe and I could achieve what I wanted by calling json_normalize(dict['Employees'], sep="_") but my problem is that it must work for ALL jsons and looking at the data beforehand is not an option so my method of normalizing this way will not always work. 然后我求助于选择一个熊猫数据json_normalize(dict['Employees'], sep="_")可以通过调用json_normalize(dict['Employees'], sep="_")达到我想要的json_normalize(dict['Employees'], sep="_")但是我的问题是它必须适用于所有json,并查看数据事先不是一个选择,所以我规范化这种方式的方法将永远无法正常工作。 Is there some way I could write some sort of function that would take in any json and convert it into a nice pandas dataframe? 有什么办法可以编写某种可以将任何json转换为漂亮的pandas数据框的函数? I have searched for about 2 weeks for answers bt with no luck regarding my specific problem. 我已经搜索了大约2周的时间来寻找答案bt,但对我的具体问题没有任何运气。 Thanks 谢谢

I've had to do that in the past (Flatten out a big nested json). 过去,我不得不这样做(展开一个大嵌套的json)。 This blog was really helpful. 这个博客真的很有帮助。 Would something like this work for you? 这样的事情对您有用吗?

Note, like the others have stated, for this to work for EVERY json, is a tall task, I'm merely offering a way to get started if you have a wider range of json format objects. 请注意,就像其他人所说的那样,要使它适用于每个JSON,这都是一项艰巨的任务,如果您有更多的json格式对象,我只是提供一种入门方法。 I'm assuming they will be relatively CLOSE to what you posted as an example with hopefully similarly structures.) 我假设它们与您作为示例发布的内容相对接近,希望结构类似。)

jsonStr = '''{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
}]
}'''

It flattens out the entire json into single rows, then you can put into a dataframe. 它将整个json展平为单行,然后可以放入数据框。 In this case it creates 1 row with 18 columns. 在这种情况下,它将创建18列的1行。 Then iterates through those columns, using the number values within those column names to reconstruct into multiple rows. 然后使用这些列名称中的数字值遍历这些列,以重构为多行。 If you had a different nested json, I'm thinking it theoretically should work, but you'll have to test it out. 如果您使用其他嵌套的json,则我认为它在理论上应该可以工作,但是您必须对其进行测试。

import json
import pandas as pd
import re

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

jsonObj = json.loads(jsonStr)
flat = flatten_json(jsonObj)



results = pd.DataFrame()
columns_list = list(flat.keys())
for item in columns_list:
    row_idx = re.findall(r'\_(\d+)\_', item )[0]
    column = item.replace('_'+row_idx+'_', '_')
    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value

print (results)

Output: 输出:

print (results)
  Employees_userId           ...              Employees_emailAddress
0           rirani           ...             romin.k.irani@gmail.com
1           nirani           ...                neilrirani@gmail.com

[2 rows x 9 columns]
d={
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
}]
}
import pandas as pd
df=pd.DataFrame([x.values() for x in d["Employees"]],columns=d["Employees"][0].keys())
print(df)

Output 产量

   userId jobTitleName firstName           ...            region  phoneNumber             emailAddress
0  rirani    Developer     Romin           ...                CA  408-1234567  romin.k.irani@gmail.com
1  nirani    Developer      Neil           ...                CA  408-1111111     neilrirani@gmail.com

[2 rows x 9 columns]

For the particular JSON data given. 对于给定的特定JSON数据。 My approach, which uses pandas package only, follows: 我的方法仅使用pandas软件包,如下所示:

import pandas as pd

# json as python's dict object
jsn = {
  "Employees" : [
    {
    "userId":"rirani",
    "jobTitleName":"Developer",
    "firstName":"Romin",
    "lastName":"Irani",
    "preferredFullName":"Romin Irani",
    "employeeCode":"E1",
    "region":"CA",
    "phoneNumber":"408-1234567",
    "emailAddress":"romin.k.irani@gmail.com"
    },
    {
    "userId":"nirani",
    "jobTitleName":"Developer",
    "firstName":"Neil",
    "lastName":"Irani",
    "preferredFullName":"Neil Irani",
    "employeeCode":"E2",
    "region":"CA",
    "phoneNumber":"408-1111111",
    "emailAddress":"neilrirani@gmail.com"
    }]
}

# get the main key, here 'Employees' with index '0'
emp = list(jsn.keys())[0]
# when you have several keys at this level, i.e. 'Employers' for example
# .. you need to handle all of them too (your task)

# get all the sub-keys of the main key[0] 
all_keys = jsn[emp][0].keys()

# build dataframe
result_df = pd.DataFrame()  # init a dataframe
for key in all_keys:
    col_vals = []
    for ea in jsn[emp]:
        col_vals.append(ea[key])
    # add a new column to the dataframe using sub-key as its header
    # it is possible that values here is a nested object(s)
    # .. such as dict, list, json
    result_df[key]=col_vals

print(result_df.to_string())

Output: 输出:

   userId lastName jobTitleName  phoneNumber             emailAddress employeeCode preferredFullName firstName region
0  rirani    Irani    Developer  408-1234567  romin.k.irani@gmail.com           E1       Romin Irani     Romin     CA
1  nirani    Irani    Developer  408-1111111     neilrirani@gmail.com           E2        Neil Irani      Neil     CA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM