簡體   English   中英

如何將嵌套的 JSON 解析為 CSV

[英]How can I parse nested JSON to CSV

I have a new project where I obtain JSON data back from a REST API - I'm trying to parse this data to csv pipe delimited to import to our legacy software I can't seem to get all the value pairs parsed properly - this is我第一次接觸 JSON,我嘗試了很多東西,但一次只做對了一點

我已經使用了 Python 並且可以獲得一些我需要的項目,但不是整個 JSON 樹 - 它以列表的形式出現,其中也有一些字典和列表我知道我的代碼不完整,只是在尋找有人指出我python 中的哪些工具可以完成工作的正確方向

import json
import csv

with open('tenants.json') as access_json:
    read_content = json.load(access_json)


for rm_access in read_content:
    rm_data = rm_access

print(rm_data)
contacts_data = rm_data['Contacts']
leases_data = rm_data['Leases']
udfs_data = rm_data['UserDefinedValues']

for contacts_access in contacts_data:
    rm_contacts = contacts_access

更新:

import pandas as pd

with open('tenants.json') as access_json:
    read_content = json.load(access_json)

for rm_access in read_content:
    rm_data = rm_access

pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 150)

TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
    rm_data = rm_access

    TenantID.append(rm_data["TenantID"])
    TenantDisplayID.append(rm_data["TenantDisplayID"])
    Name.append(rm_data["Name"])
    FirstName.append(rm_data["FirstName"])
    LastName.append(rm_data["LastName"])
    WebMessage.append(rm_data["WebMessage"])
    Comment.append(rm_data["Comment"])
    RentDueDay.append(rm_data["RentDueDay"])
    RentPeriod.append(rm_data["RentPeriod"])
#    FirstContact.append(rm_data["FirstContact"])
    PropertyID.append(rm_data["PropertyID"])
    PostingStartDate.append(rm_data["PostingStartDate"])
    CreateDate.append(rm_data["CreateDate"])
    CreateUserID.append(rm_data["CreateUserID"])
    UpdateUserID.append(rm_data["UpdateUserID"])
    Contacts.append(rm_data["Contacts"])


df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name"
: Name,"FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,"Com
ment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod, "PropertyID"
: PropertyID, "PostingStartDate": PostingStartDate,"CreateDate": CreateDate, "Cr
eateUserID": CreateUserID,"UpdateUserID": UpdateUserID,"Contacts": Contacts})

print(df)

這是文件的樣本

[
  {
    "TenantID": 115,
    "TenantDisplayID": 115,
    "Name": "Jane Doe",
    "FirstName": "Jane",
    "LastName": "Doe",
    "WebMessage": "",
    "Comment": "",
    "RentDueDay": 1,
    "RentPeriod": "Monthly",
    "FirstContact": "2015-11-01T15:30:00",
    "PropertyID": 17,
    "PostingStartDate": "2010-10-01T00:00:00",
    "CreateDate": "2014-04-16T13:35:37",
    "CreateUserID": 1,
    "UpdateDate": "2017-03-22T11:31:48",
    "UpdateUserID": 1,
    "Contacts": [
      {
        "ContactID": 128,
        "FirstName": "Jane",
        "LastName": "Doe",
        "MiddleName": "",
        "IsPrimary": true,
        "DateOfBirth": "1975-02-27T00:00:00",
        "FederalTaxID": "111-11-1111",
        "Comment": "",
        "Email": "jane.doe@mail.com",
        "License": "ZZT4532",
        "Vehicle": "BMW 3 Series",
        "IsShowOnBill": true,
        "Employer": "REW",
        "ApplicantType": "Applicant",
        "CreateDate": "2014-04-16T13:35:37",
        "CreateUserID": 1,
        "UpdateDate": "2017-03-22T11:31:48",
        "AnnualIncome": 0.0,
        "UpdateUserID": 1,
        "ParentID": 115,
        "ParentType": "Tenant",
        "PhoneNumbers": [
          {
            "PhoneNumberID": 286,
            "PhoneNumberTypeID": 2,
            "PhoneNumber": "703-555-5610",
            "Extension": "",
            "StrippedPhoneNumber": "7035555610",
            "IsPrimary": true,
            "ParentID": 128,
            "ParentType": "Contact"
          }
        ]
      }
    ],
    "UserDefinedValues": [
      {
        "UserDefinedValueID": 1,
        "UserDefinedFieldID": 4,
        "ParentID": 115,
        "Name": "Emerg Contact Name",
        "Value": "Terry Harper",
        "UpdateDate": "2016-01-22T15:41:53",
        "FieldType": "Text",
        "UpdateUserID": 2,
        "CreateUserID": 2
      },
      {
        "UserDefinedValueID": 174,
        "UserDefinedFieldID": 5,
        "ParentID": 115,
        "Name": "Emerg Contact Phone",
        "Value": "703-555-3568",
        "UpdateDate": "2016-01-22T15:42:03",
        "FieldType": "Text",
        "UpdateUserID": 2,
        "CreateUserID": 2
      }
    ],
    "Leases": [
      {
        "LeaseID": 115,
        "TenantID": 115,
        "UnitID": 181,
        "PropertyID": 17,
        "MoveInDate": "2010-10-01T00:00:00",
        "SortOrder": 1,
        "CreateDate": "2014-04-16T13:35:37",
        "UpdateDate": "2017-03-22T11:31:48",
        "CreateUserID": 1,
        "UpdateUserID": 1
      }
    ],
    "Addresses": [
      {
        "AddressID": 286,
        "AddressTypeID": 1,
        "Address": "14393 Montgomery Road Lot #102\r\nCincinnati, OH 45122",
        "Street": "14393 Montgomery Road Lot #102",
        "City": "Cincinnati",
        "State": "OH",
        "PostalCode": "45122",
        "IsPrimary": true,
        "ParentID": 115,
        "ParentType": "Tenant"
      }
    ],
    "OpenReceivables": [],
    "Status": "Current"
  },

並非所有租戶都會擁有所有元素,這也很棘手

我需要頂部有 TenantID、TenantDisplayID 等的數據 我還需要來自 Contacts、PhoneNumbers、Leases 等值的數據 每行應該是 static 所以如果它沒有某些標簽,那么我想要一個Null 或 None 所以它看起來像 TenantID|TenantDisplayID|FirstName….etc 所以每行都有相同數量的字段

像這樣的東西應該工作:

import pandas as pd 
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 100000)
TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
    rm_data = rm_access

    print(rm_data)
    TenantID.append(rm_data["TenantID"])
    TenantDisplayID.append(rm_data["TenantDisplayID"])
    Name.append(rm_data["Name"])
    FirstName.append(rm_data["FirstName"])
    LastName.append(rm_data["LastName"])
    WebMessage.append(rm_data["WebMessage"])
    Comment.append(rm_data["Comment"])
    RentDueDay.append(rm_data["RentDueDay"])
    RentPeriod.append(rm_data["RentPeriod"])
    FirstContact.append(rm_data["FirstContact"])
    PropertyID.append(rm_data["PropertyID"])
    PostingStartDate.append(rm_data["PostingStartDate"])
    CreateDate.append(rm_data["CreateDate"])
    CreateUserID.append(rm_data["CreateUserID"])
    UpdateUserID.append(rm_data["UpdateUserID"])
    Contacts.append(rm_data["Contacts"])


df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name": Name,
                   "FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,
                   "Comment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod,
                   "FirstContact": FirstContact, "PropertyID": PropertyID, "PostingStartDate": PostingStartDate,
                   "CreateDate": CreateDate, "CreateUserID": CreateUserID,"UpdateUserID": UpdateUserID,
                   "Contacts": Contacts})

print(df)

一般問題

這個任務(和其他類似任務)的問題不僅僅是如何創建一個算法——我相信你理論上可以用(不是那么)大量嵌套的 for 循環來解決這個問題。 問題是以一種你不會頭疼的方式組織代碼——即以一種你可以輕松修復錯誤、你可以編寫單元測試、你可以通過閱讀代碼輕松理解代碼的方式(在六個月內從現在開始),並且您可以在需要時輕松更改代碼。 我不知道有誰在深入嵌套結構時不會犯錯誤。 並且因為它反映了數據的嵌套結構,所以在高度嵌套的代碼中尋找錯誤可能會非常令人沮喪。

快速(最有可能:最佳)解決方案

依賴為您的確切用例制作的軟件包,例如

https://github.com/cwacek/python-jsonschema-objects

如果您有 API 模式的正式定義,您可以為此使用包。 If, for instance, your API has a Swagger schema definition, you cann use swagger-py ( https://github.com/digium/swagger-py ) to get your JSON response into Python objects.

原理解決方案:Object 面向編程和遞歸

即使您的具體用例可能有一些庫,我也想解釋一下如何處理“那種”任務的原則:

為此類問題組織代碼的一個好方法是使用Object Oriented Programming 利用遞歸原理,可以將嵌套的麻煩布置得更清楚。 如果您的 API 響應的 JSON 架構因任何原因發生更改(例如,更新 API),這也可以更輕松地更改代碼。 在您的情況下,我建議您創建如下內容:

class JsonObject:
    """Parent Class for any Object that will be retrieved from the JSON
    and potentially has nested JsonObjects inside.

    This class takes care of parsing the json into python Objects and deals
    with the recursion into the nested structures."""

    primitives = []
    json_objects = {
        # For each class, this dict defines all the "embedded" classes which
        # live directly "under" that class in the nested JSON. It will have the
        # following structure:

        # attribute_name : class

        # In your case the JSON schema does not have any "single" objects
        # in the nesting strcuture, but only lists of nested objects. I
        # still , to demonstrate how you would do it in case, there would be
        # single "embedded"
    }
    json_object_lists = {
        # For each class, this dict defines all the "embedded" subclasses which
        # are provided in a list "under" that class in the nested JSON.
        # It will have the following structure:

        # attribute_name : class
    }

    @classmethod
    def from_dict(cls, d: dict) -> "JsonObject":
        instance = cls()

        for attribute in cls.primitives:
            # Here we just parse all the primitives
            instance.attribute = getattr(d, attribute, None)

        for attribute, klass in cls.json_object_lists.items():
            # Here we parse all lists of embedded JSON Objects
            nested_objects = []
            l = getattr(d, attribute, [])
            for nested_dict in l:
                nested_objects += klass.from_dict(nested_dict)

            setattr(instance, attribute, nested_objects)

        for attribute, klass in cls.json_objects.items():
            # Here we parse all "single" embedded JSON Objects
            setattr(
                instance,
                attribute,
                klass.from_dict(getattr(d, attribute, None)
            )

    def to_csv(self) -> str:
        pass

由於您沒有解釋您究竟想如何從 JSON 創建 csv,因此我沒有實現該方法並將其留給您。 也沒有必要解釋整體方法。

現在我們有了所有我們特定的繼承自的通用父 class,這樣我們就可以對我們的問題應用遞歸。 現在我們只需要根據我們要解析的 JSON schema 來定義這些具體的結構。 我從您的示例中獲得了以下內容,但您可以輕松更改所需的內容:

class Address(JsonObject):
    primitives = [
        "AddressID",
        "AddressTypeID",
        "Address",
        "Street",
        "City",
        "State",
        "PostalCode",
        "IsPrimary",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {}


class Lease(JsonObject):
    primitives = [
        "LeaseID",
        "TenantID",
        "UnitID",
        "PropertyID",
        "MoveInDate",
        "SortOrder",
        "CreateDate",
        "UpdateDate",
        "CreateUserID",
        "UpdateUserID",
    ]

    json_objects = {}
    json_object_lists = {}


class UserDefinedValue(JsonObject):
    primitives = [
        "UserDefinedValueID",
        "UserDefinedFieldID",
        "ParentID",
        "Name",
        "Value",
        "UpdateDate",
        "FieldType",
        "UpdateUserID",
        "CreateUserID",
    ]

    json_objects = {}
    json_object_lists = {}


class PhoneNumber(JsonObject):
    primitives = [
        "PhoneNumberID",
        "PhoneNumberTypeID",
        "PhoneNumber",
        "Extension",
        "StrippedPhoneNumber",
        "IsPrimary",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {}

class Contact(JsonObject):
    primitives = [
        "ContactID",
        "FirstName",
        "LastName",
        "MiddleName",
        "IsPrimary",
        "DateOfBirth",
        "FederalTaxID",
        "Comment",
        "Email",
        "License",
        "Vehicle",
        "IsShowOnBill",
        "Employer",
        "ApplicantType",
        "CreateDate",
        "CreateUserID",
        "UpdateDate",
        "AnnualIncome",
        "UpdateUserID",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {
        "PhoneNumbers": PhoneNumber,
    }


class Tenant(JsonObject):
    primitives = [
        "TenantID",
        "TenantDisplayID",
        "Name",
        "FirstName",
        "LastName",
        "WebMessage",
        "Comment",
        "RentDueDay",
        "RentPeriod",
        "FirstContact",
        "PropertyID",
        "PostingStartDate",
        "CreateDate",
        "CreateUserID",
        "UpdateDate",
        "UpdateUserID",
        "OpenReceivables",  # Maybe this is also a nested Object? Not clear from your sample.
        "Status",
    ]

    json_object_lists = {
        "Contacts": Contact,
        "UserDefinedValues": UserDefinedValue,
        "Leases": Lease,
        "Addresses": Address,
    }

    json_objects = {}

您可能會想像這種方法的“美”(至少:順序),它在於以下內容:使用這種結構,我們可以解決 API 的 JSON 響應中的任何級別的嵌套,而不會帶來額外的頭痛 - 我們的代碼不會加深它的縮進級別,因為我們已經將討厭的嵌套分離到JsonObjectfrom_json方法的遞歸定義中。 這就是為什么現在更容易識別錯誤或對我們的代碼應用更改。

要最終將 JSON 現在解析為我們的對象,您可以執行以下操作:

import typing
import json


def tenants_from_json(json_string: str) -> typing.Iterable["Tenant"]:
    tenants = [
        Tenant.from_dict(tenant_dict)
        for tenant_dict in json.loads(json_string)
    ]
    return tenants

重要的最后邊注:這只是基本原則

我的代碼示例只是對使用對象和遞歸來處理結構的壓倒性(和令人討厭的)嵌套的想法的一個非常簡短的介紹。 代碼有一些缺陷。 例如,應該避免定義可變的 class 變量。 當然,整個代碼應該驗證它從 API 獲得的數據。 您可能還想添加每個屬性的類型並在 Python 對象中正確表示(例如,您的示例有整數、日期時間和字符串)。

我真的只想在這里向您展示 Object 面向編程的原理。

我沒有花時間測試我的代碼。 因此,可能還存在錯誤。 再一次,我只是想演示一下原理。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM