简体   繁体   English

嵌套 JSON 到 Pandas 中的多个 Dataframe

[英]Nested JSON to Multiple Dataframe in Pandas

I am trying to build a tool which can take any JSON data and convert that into multiple data frame based on data types.我正在尝试构建一个工具,它可以获取任何 JSON 数据并根据数据类型将其转换为多个数据帧。 I am trying to add each data frame with a relation so that we can identify which data belong to which parent element(key).我正在尝试为每个数据框添加一个关系,以便我们可以识别哪些数据属于哪个父元素(键)。

For Example:例如:

    {

       "name":"Some name"
       "date": "12:23:2022"
       "Students":[

               {
                "id":",some id"
                 "value": "some val"
         },
         {
                "id":",some id2"
                 "value": "some val2"
         },    {
                "id":",some id3"
                 "value": "some val3"
         },

        ],
       "Error":[

               {
                "id":",some id",
                 "code": "some code",
                "emessage":[
                   {
 
                     "err_trac":"Missing syntax",
                     "Err_code":";"
                   },
                   {
 
                     "err_trac":"invalid syntax",
                     "Err_code":"="
                   }
                 ]
         },
         {
                "id":",some id2",
                "code": "some code 2",
                "emessage":[
                   {
 
                     "err_trac":"Missing syntax",
                     "Err_code":";"
                   },
                   {
 
                     "err_trac":"invalid syntax",
                     "Err_code":"="
                   }
                 ]

         },    {
                "id":",some id3",
                "code": "some code3",
                "emessage":[
                   {
 
                     "err_trac":"Missing syntax",
                     "Err_code":";"
                   },
                   {
 
                     "err_trac":"invalid syntax",
                     "Err_code":"="
                   }
                 ]

         },

        ]


    }

I wanted to have data frame such as我想要数据框,例如

Run 
name, date , id (uuid)
Error 
 id, code parent_id(id of run), id (uuid)

Students 
 id, value, parent_id(id of run) , id (uuid)

emessage 
  err_trac, Err_code , parent_id(id of Error )

And have a relations with UUID to identify which key belongs to which parents id.并与 UUID 建立关系,以识别哪个密钥属于哪个父 ID。 I am trying the flattening approach to solve this problem using python and pandas. But my solution does not works for nested JSON.我正在尝试使用 python 和 pandas 的扁平化方法来解决此问题。但我的解决方案不适用于嵌套的 JSON。

Here is what I am trying.这就是我正在尝试的。

import json
import pandas as pd

op = {}
import uuid


def format_string(string):

    return string.replace(" ", "_")


def get_prefix(prefix, key):

    if not key:
        return format_string(prefix)
    if prefix:
        return format_string(prefix + "_" + key)
    else:
        return key


def flatten(prefix, key, value, uid, result=[]):

    if isinstance(value, str):
        result.append({get_prefix(prefix, key): value})
    if isinstance(value, dict):
        for item in value.keys():
            flatten(get_prefix(prefix, key), item, value.get(item), uid, result)
    if isinstance(value, list):
        if prefix:
           for i in range(len(value)):
                flatten(
                    get_prefix(prefix, key + "[{}]".format(i)),
                    "",
                    value[i],
                    uid,
                    op[key],
                )
        else:
            for i in range(len(value)):
                flatten(
                    get_prefix(prefix, key + "[{}]".format(i)),
                    "",
                    value[i],
                    uid,
                    result,
                )
        res = {key: val for d in result for key, val in d.items()}
        df = pd.DataFrame.from_dict(res, orient="index")
        df["uuid"] = uid
        op["result"] = df
        return result


def solution() -> str:

    f = open("example-input/sample.json", "r")

    if f:
        str_val = json.load(f)
        print("j")
        for key, value in str_val.items():
            #  pd_op = pd.json_normalize(str_val)
            #  print(pd_op.columns)
            #  for x in pd_op["run.tip usage"]:
            #      print(x[0])
            #  break
            flatten("", key, str_val.get(key), uuid.uuid4())
    return op


print(solution())

Update更新

The reason I wanted to create multiple dataframe is to put this data into Datalake and later access it via Athena in AWS.我想创建多个 dataframe 的原因是将这些数据放入 Datalake,然后通过 AWS 中的 Athena 访问它。 Once I get the dataframe I can move them into SQL tables.一旦我得到 dataframe,我就可以将它们移动到 SQL 表中。

The structure you are describing - a JSON of an indefinitely defined number of nested JSONs - fits exactly with a tree data structure.您所描述的结构 - 无限定义数量的嵌套 JSON 的 JSON - 完全符合树数据结构。 Since we are looking to store the ID of the parent JSON in each dataframe, we will approach this with BFS (breadth first search) aka level order traversal.由于我们希望在每个 dataframe 中存储父 JSON 的 ID,因此我们将使用 BFS(广度优先搜索)也就是层序遍历来解决这个问题。 This is a common graph traversal algorithm well suited to this kind of problem.这是一种常见的图遍历算法,非常适合这类问题。

If a element has an id of None, it indicates it is the root or top level element.如果元素的 id 为 None,则表示它是根元素或顶级元素。

import json
import pandas as pd
import uuid


def nested_json_list_df(file : str):
    dict_list = []
    def bfs(json_dict : dict, node_name : str, parent_uuid : uuid.UUID):
        """Breadth First Search a.k.a. Level order Traversal
        """
        # Create parent node
        uuid_val = uuid.uuid1()
        out_dict = {'id': node_name, 'uuid': uuid_val, 'parent id': parent_uuid}

        # Search child nodes
        for key, val in json_dict.items():
            # If a child node is a dict itself, it is a sub-nested JSON
            if isinstance(val, dict):
                bfs(val, key, uuid_val)
            # A list of single-nested dicts is simply a new entry
            # A list containing dicts within dicts is interpreted
            # as another nested JSON
            elif isinstance(val, list):
                new_val = []
                for v in val:
                    if isinstance(v, dict):
                        new_dict = dict()
                        for key2, val2 in v.items():
                            # Indicates nested JSONs
                            if isinstance(val2, dict):
                                bfs(val2, key, uuid_val)
                            else:
                                new_dict[key2] = val2
                        new_val.append(new_dict)
                    else:
                        new_val.append(v)
                uuid2 = uuid.uuid1()
                out_dict2 = {'id': key, 'uuid': uuid2, 'parent id': parent_uuid, key : new_val}
                dict_list.append({key : out_dict2})
            else:
                out_dict[key] = val
        
        dict_list.append({node_name : out_dict})
        return dict_list
    
    ## Run BFS ##
    with open(file) as f:
        json_dict = json.load(f)
    df_list = []
    for d in bfs(json_dict, file, None):
        df_list.append(pd.DataFrame(d))
    return df_list


df_list = nested_json_list_df('temp.json')
for df in df_list:
     print(df)

Output: Output:

                                                    Students
Students   [{'id': ',some id', 'value': 'some val'}, {'id...
id                                                  Students
parent id                                               None
uuid                    2d68cce3-c7f7-11ec-81a3-b0227ae68aa0
                                                       Error
Error      [{'id': ',some id', 'code': 'some code', 'emes...
id                                                     Error
parent id                                               None
uuid                    2d68cce4-c7f7-11ec-b0da-b0227ae68aa0
                                      temp.json
date                                 12:23:2022
id                                    temp.json
name                                  Some name
parent id                                  None
uuid       2d68cce2-c7f7-11ec-bcd8-b0227ae68aa0

Level Order Traversal 层序遍历

Breadth First Search 广度优先搜索

I do not understand your intention completely, but i think json_normalize is the way to go. (For me your given data is mssing some commas).我不完全理解你的意图,但我认为json_normalize是通往 go 的方式。(对我来说,你给定的数据是 mssing 一些逗号)。 With pd.normalize you can easier read json data into an dataframe. If you have a lot of nested dicts (without lists) with this you can directly flatten your data.使用pd.normalize ,您可以更轻松地将 json 数据读入 dataframe。如果您有很多嵌套的字典(没有列表),您可以直接将数据展平。 I took the name and the Error.id as reference.我将nameError.id作为参考。

With record_path=[...] you can select specific elements.使用record_path=[...]您可以 select 特定元素。 With meta=[...] you cann add some other data of the json to your elements.使用meta=[...]您可以将 json 的一些其他数据添加到您的元素中。

import pandas as pd


df_main = pd.json_normalize(json_data)[["name", "date"]]
df_error = pd.json_normalize(json_data, record_path=["Error"], meta=[["name"]])[["id", "code", "name"]]
df_students = pd.json_normalize(json_data, record_path=["Students"], meta=["name"])
df_error_messages = pd.json_normalize(json_data, record_path=["Error", "emessage"], meta=[["Error", "id"]])

print(df_main)
print(df_error)
print(df_students)
print(df_error_messages)

The outputs:输出:

run跑步

        name        date
0  Some name  12:23:2022

Error错误

          id         code       name
0   ,some id    some code  Some name
1  ,some id2  some code 2  Some name
2  ,some id3   some code3  Some name

Students学生们

          id      value       name
0   ,some id   some val  Some name
1  ,some id2  some val2  Some name
2  ,some id3  some val3  Some name

error messages错误信息

         err_trac Err_code   Error.id
0  Missing syntax        ;   ,some id
1  invalid syntax        =   ,some id
2  Missing syntax        ;  ,some id2
3  invalid syntax        =  ,some id2
4  Missing syntax        ;  ,some id3
5  invalid syntax        =  ,some id3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM