如何为存储在嵌套 JSON 文件中的数据库模式中的每个表元数据（列名、类型、格式）创建 Pandas dataframe

Question

我有一个 JSON 文件，其中包含模式中保存的表的元数据。

我想为 JSON 文件中定义的每个表创建一个 dataframe，即 Person、HomeAddress、Employment。 Person 和 Empty 处于同一级别，但 HomeAddress 嵌套在 Person 中。

例如数据框（人）

 Column_Name     Type     Format      Required
 Person_ID       Integer              Yes
 DateOfBirth     String   date-time   Yes
 ...........

文件内容如下；

{
    "$id": "12121212",
    "type": "object",
    "properties": {
        "PersonId": {
            "type": "integer"
        },
        "Person": {
            "type": ["object", "null"],
            "properties": {
                "PersonId": {
                    "type": "integer"
                },
                "DateOfBirth": {
                    "type": "string",
                    "format": "date-time"
                },
                "DateOfBirthVerified": {
                    "type": "boolean"
                },
                "Sex": {
                    "type": ["string", "null"]
                },
                "Surname": {
                    "type": ["string", "null"]
                },
                "Initials": {
                    "type": ["string", "null"]
                },
                "Forenames": {
                    "type": ["string", "null"]
                },
                "Title": {
                    "type": ["string", "null"]
                },
                "NationalIdNumber": {
                    "type": ["string", "null"]
                },
                "HomeAddress": {
                    "type": ["object", "null"],
                    "properties": {
                        "EffectiveDate": {
                            "type": "string",
                            "format": "date-time"
                        },
                        "EndDate": {
                            "type": "string",
                            "format": "date-time"
                        },
                        "Category": {
                            "type": ["string", "null"]
                        },
                        "Line1": {
                            "type": ["string", "null"]
                        },
                        "Line2": {
                            "type": ["string", "null"]
                        },
                        "Line3": {
                            "type": ["string", "null"]
                        },
                        "Line4": {
                            "type": ["string", "null"]
                        },
                        "City": {
                            "type": ["string", "null"]
                        },
                        "County": {
                            "type": ["string", "null"]
                        },
                        "Country": {
                            "type": ["string", "null"]
                        },
                        "CareOfAddressee": {
                            "type": ["string", "null"]
                        },
                        "PostCode": {
                            "type": ["string", "null"]
                        },
                        "SuspectAddress": {
                            "type": "boolean"
                        },
                        "Overseas": {
                            "type": "boolean"
                        }
                    },
                    "required": ["EffectiveDate", "EndDate", "Category", "Line1", "Line2", "Line3", "Line4", "City", "County", "Country", "CareOfAddressee", "PostCode", "SuspectAddress", "Overseas"]
                }
            },
            "required": ["PersonId", "DateOfBirth", "DateOfBirthVerified", "Sex", "Surname", "Initials", "Forenames", "Title", "NationalIdNumber", "HomeAddress"]
        },
        "Employment": {
            "type": ["object", "null"],
            "properties": {
                "EmployeeReference": {
                    "type": ["string", "null"]
                },
                "DateFirstEmployed": {
                    "type": "string",
                    "format": "date-time"
                },
                "PayrollNumber": {
                    "type": ["string", "null"]
                }
            },
            "required": ["EmployeeReference", "DateFirstEmployed", "PayrollNumber"]
        }
    },
    "required": ["PersonId", "Person", "Employment"]
}

Answer 1

令d为文件内容的字典。 然后你可以递归地解决这个问题，如下所示：

import pandas as pd
import numpy as np

def get_props(d, required=[]):
    props = []
    for k, v in d.items():
        if isinstance(v, dict):
            if 'type' in v.keys():
                props.append({
                    'Column_Name': k,
                    'Format': v['format'] if 'format' in v.keys() else np.NaN,
                    'Type': v['type'] if isinstance(v['type'], str) else v['type'][0],
                    'Required': 'Yes' if k in required else 'No'
                })
            props.extend(get_props(v, required=d['required'] if 'required' in d else []))
    return props

df = pd.DataFrame(get_props(d))
print(df)

印刷

指数	列名	格式	类型	必需的
0	个人身份	钠	integer	是的
1	人	钠	object	是的
2	个人身份	钠	integer	是的
3	出生日期	约会时间	细绳	是的
4	DateOfBirthVerified	钠	boolean	是的
5	性别	钠	细绳	是的
6	姓	钠	细绳	是的
7	缩写	钠	细绳	是的
8	名字	钠	细绳	是的
9	标题	钠	细绳	是的
10	身份证号码	钠	细绳	是的
11	家庭地址	钠	object	是的
12	生效日期	约会时间	细绳	是的
13	结束日期	约会时间	细绳	是的
14	类别	钠	细绳	是的
15	1号线	钠	细绳	是的
16	2号线	钠	细绳	是的
17	3号线	钠	细绳	是的
18	4号线	钠	细绳	是的
19	城市	钠	细绳	是的
20	县	钠	细绳	是的
21	国家	钠	细绳	是的
22	CareOfAddresse	钠	细绳	是的
23	邮政编码	钠	细绳	是的
24	嫌疑人地址	钠	boolean	是的
25	海外	钠	boolean	是的
26	就业	钠	object	是的
27	员工参考	钠	细绳	是的
28	就业日期	约会时间	细绳	是的
29	工资单号	钠	细绳	是的

如何为存储在嵌套 JSON 文件中的数据库模式中的每个表元数据（列名、类型、格式）创建 Pandas dataframe

问题描述

1 个解决方案

解决方案1
0 2022-08-07 04:16:30

如何为存储在嵌套 JSON 文件中的数据库模式中的每个表元数据（列名、类型、格式）创建 Pandas dataframe

问题描述

1 个解决方案

解决方案1 0 2022-08-07 04:16:30

解决方案1
0 2022-08-07 04:16:30