将嵌套的 JSON 解析为一个数据文件

Question

I am trying to parse a nested json.我正在尝试解析嵌套的 json。

I've got the dataset stored here so that you can see what I'm seeing specifically if you want: https://mega.nz/file/YWNSRBjK#V9DpoY5LSp-VL8Mnu7NEfNf3FhDOCj9FHBiTQ4KHEa8我已将数据集存储在这里，以便您可以根据需要查看我所看到的具体内容： https://mega.nz/file/YWNSRBjK#V9DpoY5LSp-VL8Mnu7NEfNf3FhDOCj9FHBiTQ4KHEa8

I am attempting to parse this using pandas json_normalize function.我正在尝试使用 pandas json_normalize function 来解析这个。 Below is what my code looks like in it's entirety.下面是我的代码的完整外观。

import gzip   
import shutil
import json
import pandas as pd

with gzip.open('testjson.json.gz', 'rb') as f_in:
    with open('unzipped_json.json', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

f = open('unzipped_json.json')
data = json.load(f)
keys = data.keys()
keys_string = list(keys)
 
### In Network
in_network_df = pd.json_normalize(data['in_network'])

### Negotiated Rates
negotiated_rates_df = pd.json_normalize(data=data['in_network'],
                                        record_path=("negotiated_rates"))
negotiated_rates_df = negotiated_rates_df.explode('provider_references')
negotiated_rates_df = negotiated_rates_df.explode('negotiated_prices')

### Negotiated Prices
negotiated_prices_df = pd.json_normalize(data=data['in_network'],
                                         meta=[
                                             #['negotiated_rates','provider_references'],
                                            # ['negotiation_arrangement', 'name','billing_code_type','billing_code','description']
                                             ],
                                        record_path=['negotiated_rates','negotiated_prices'],
                                        errors='ignore')
negotiated_prices_df = negotiated_prices_df.explode('service_code')

### Provider References
provider_references_df = pd.json_normalize(data['provider_references'])
provider_references_test = provider_references_df.explode('provider_groups')

### Provider Groups
provider_groups = pd.json_normalize(data=data['provider_references'],
                                    meta=['provider_group_id'],
                                        record_path=("provider_groups"))
provider_groups = provider_groups.explode('npi')

I am specifically having trouble with the negotiated prices part of this json object.我对这个 json object 的协商价格部分特别有问题。 I am trying to add in some data from parent objects, but it is giving me an error.我正在尝试添加来自父对象的一些数据，但它给了我一个错误。 To point out specifically what I would like to do here it is below.为了具体指出我想在这里做的事情，如下所示。

negotiated_prices_df = pd.json_normalize(data=data['in_network'],
                                         meta=['provider_references'],
                                        record_path=['negotiated_rates','negotiated_prices'],
                                        errors='ignore')

When I try to do this I get ValueError: operands could not be broadcast together with shape (74607,) (24869,)当我尝试这样做时，我得到 ValueError: operands could not be broadcast together with shape (74607,) (24869,)

Can anyone help me understand what is going on here?谁能帮我理解这里发生了什么？

Edit: Trying to provide some more context in case someone is not wanting to open my file... Here is one spot showing the problematic portion I'm dealing with in the JSON.编辑：尝试提供更多上下文以防有人不想打开我的文件...这里有一个地方显示了我在 JSON 中处理的有问题的部分。 I can't seem to get the provider_references to attach to any of the child objects.我似乎无法让 provider_references 附加到任何子对象。

"provider_references":[261, 398, 799],"negotiated_prices":[{"negotiated_type": "fee schedule","negotiated_rate": 296.00,"expiration_date": "2023-06-30","service_code": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "provider_references":[261, 398, 799],"negotiated_prices":[{"negotiated_type": "费用表","negotiated_rate": 296.00,"expiration_date": "2023-06-30","service_code": [ “01”、“02”、“03”、“04”、“05”、“06”、“07”、“08”、“09”、“10”、“11”、“12”、“13” ",

Answer 1

I think the code that you want looks like this:我认为您想要的代码如下所示：

with open('unzipped_json.json') as f:
    data = json.load(f)

negotiated_rates_and_prices_df = pd.json_normalize(
    data["in_network"],
    record_path=["negotiated_rates", ["negotiated_prices"]],
    meta=[
        "negotiation_arrangement",
        "name",
        "billing_code_type",
        "billing_code_type_version",
        "billing_code",
        "description",
        ["negotiated_rates", "provider_references"],
    ],
)

That takes care of the in_network part of the JSON.这负责 JSON 的in_network部分。 The trick is that within the metadata path you want to put the columns which are not nested in a regular list, and the nested ones in the order of nesting (ie ["negotiated_rates", "provider_references"] ).诀窍是在元数据路径中，您希望将未嵌套在常规列表中的列以及嵌套的列按嵌套顺序放置（即["negotiated_rates", "provider_references"] ）。 There's a similar example in the docs here . 这里的文档中有一个类似的例子。

Then for the other nested part of the JSON you can do this:然后对于 JSON 的其他嵌套部分，您可以这样做：

provider_references_df = pd.json_normalize(
    data["provider_references"], "provider_groups", "provider_group_id"
)

And that takes care of the whole thing.这照顾了整个事情。

Answer 2

Is this what you are trying to achieve?这是你想要达到的目标吗？

import pandas as pd
import json

with open("testjson.json", "r") as f:
    data = json.load(f)
for k, v in data.items():
    print(k)
negotiated_prices_df = pd.json_normalize(data['in_network'], record_path=['negotiated_rates', ['negotiated_prices']], meta = ['negotiation_arrangement','name','billing_code_type','billing_code_type_version', 'billing_code', 'description', ['negotiated_rates', 'provider_references']], errors='ignore').explode('service_code', ignore_index=True)

print(negotiated_prices_df)

Result printed in terminal:终端打印的结果：

    negotiated_type negotiated_rate expiration_date service_code    billing_class   negotiation_arrangement name    billing_code_type   billing_code_type_version   billing_code    description negotiated_rates.provider_references
0   fee schedule    296.00  2023-06-30  01  professional    ffs nasal/s CPT 2022    31240   Nasal/sinus endoscopy surg  [261, 398, 799]
1   fee schedule    296.00  2023-06-30  02  professional    ffs nasal/s CPT 2022    31240   Nasal/sinus endoscopy surg  [261, 398, 799]
2   fee schedule    296.00  2023-06-30  03  professional    ffs nasal/s CPT 2022    31240   Nasal/sinus endoscopy surg  [261, 398, 799]
3   fee schedule    296.00  2023-06-30  04  professional    ffs nasal/s CPT 2022    31240   Nasal/sinus endoscopy surg  [261, 398, 799]
4   fee schedule    296.00  2023-06-30  05  professional    ffs nasal/s CPT 2022    31240   Nasal/sinus endoscopy surg  [261, 398, 799]
... ... ... ... ... ... ... ... ... ... ... ... ...
687789  negotiated  15461.36    2023-06-30  NaN institutional   ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   APR-DRG 39.1    192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   [191]
687790  negotiated  11953.00    2023-06-30  NaN institutional   ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   APR-DRG 39.1    192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   [521, 688]
687791  negotiated  12622.15    2023-06-30  NaN institutional   ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   APR-DRG 39.1    192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   [1003, 1045, 11, 1174, 133, 149, 177, 186, 251, 27, 564, 649, 683, 697, 705, 764, 827, 836, 837, 87, 937, 938]
687792  negotiated  11864.00    2023-06-30  NaN institutional   ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   APR-DRG 39.1    192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   [1176, 319, 974]
687793  negotiated  11229.02    2023-06-30  NaN institutional   ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   APR-DRG 39.1    192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS   [371, 523]

将嵌套的 JSON 解析为一个数据文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-09-12 12:19:21

解决方案2
0 2022-09-12 08:21:11

将嵌套的 JSON 解析为一个数据文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-09-12 12:19:21

解决方案2 0 2022-09-12 08:21:11

解决方案1
2 已采纳 2022-09-12 12:19:21

解决方案2
0 2022-09-12 08:21:11