[英]Parsing Nested JSON to one data file
I am trying to parse a nested json.我正在尝试解析嵌套的 json。
I've got the dataset stored here so that you can see what I'm seeing specifically if you want: https://mega.nz/file/YWNSRBjK#V9DpoY5LSp-VL8Mnu7NEfNf3FhDOCj9FHBiTQ4KHEa8我已将数据集存储在这里,以便您可以根据需要查看我所看到的具体内容: https://mega.nz/file/YWNSRBjK#V9DpoY5LSp-VL8Mnu7NEfNf3FhDOCj9FHBiTQ4KHEa8
I am attempting to parse this using pandas json_normalize function.我正在尝试使用 pandas json_normalize function 来解析这个。 Below is what my code looks like in it's entirety.
下面是我的代码的完整外观。
import gzip
import shutil
import json
import pandas as pd
with gzip.open('testjson.json.gz', 'rb') as f_in:
with open('unzipped_json.json', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
f = open('unzipped_json.json')
data = json.load(f)
keys = data.keys()
keys_string = list(keys)
### In Network
in_network_df = pd.json_normalize(data['in_network'])
### Negotiated Rates
negotiated_rates_df = pd.json_normalize(data=data['in_network'],
record_path=("negotiated_rates"))
negotiated_rates_df = negotiated_rates_df.explode('provider_references')
negotiated_rates_df = negotiated_rates_df.explode('negotiated_prices')
### Negotiated Prices
negotiated_prices_df = pd.json_normalize(data=data['in_network'],
meta=[
#['negotiated_rates','provider_references'],
# ['negotiation_arrangement', 'name','billing_code_type','billing_code','description']
],
record_path=['negotiated_rates','negotiated_prices'],
errors='ignore')
negotiated_prices_df = negotiated_prices_df.explode('service_code')
### Provider References
provider_references_df = pd.json_normalize(data['provider_references'])
provider_references_test = provider_references_df.explode('provider_groups')
### Provider Groups
provider_groups = pd.json_normalize(data=data['provider_references'],
meta=['provider_group_id'],
record_path=("provider_groups"))
provider_groups = provider_groups.explode('npi')
I am specifically having trouble with the negotiated prices part of this json object.我对这个 json object 的协商价格部分特别有问题。 I am trying to add in some data from parent objects, but it is giving me an error.
我正在尝试添加来自父对象的一些数据,但它给了我一个错误。 To point out specifically what I would like to do here it is below.
为了具体指出我想在这里做的事情,如下所示。
negotiated_prices_df = pd.json_normalize(data=data['in_network'],
meta=['provider_references'],
record_path=['negotiated_rates','negotiated_prices'],
errors='ignore')
When I try to do this I get ValueError: operands could not be broadcast together with shape (74607,) (24869,)当我尝试这样做时,我得到 ValueError: operands could not be broadcast together with shape (74607,) (24869,)
Can anyone help me understand what is going on here?谁能帮我理解这里发生了什么?
Edit: Trying to provide some more context in case someone is not wanting to open my file... Here is one spot showing the problematic portion I'm dealing with in the JSON.编辑:尝试提供更多上下文以防有人不想打开我的文件...这里有一个地方显示了我在 JSON 中处理的有问题的部分。 I can't seem to get the provider_references to attach to any of the child objects.
我似乎无法让 provider_references 附加到任何子对象。
"provider_references":[261, 398, 799],"negotiated_prices":[{"negotiated_type": "fee schedule","negotiated_rate": 296.00,"expiration_date": "2023-06-30","service_code": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13",
"provider_references":[261, 398, 799],"negotiated_prices":[{"negotiated_type": "费用表","negotiated_rate": 296.00,"expiration_date": "2023-06-30","service_code": [ “01”、“02”、“03”、“04”、“05”、“06”、“07”、“08”、“09”、“10”、“11”、“12”、“13” ",
I think the code that you want looks like this:我认为您想要的代码如下所示:
with open('unzipped_json.json') as f:
data = json.load(f)
negotiated_rates_and_prices_df = pd.json_normalize(
data["in_network"],
record_path=["negotiated_rates", ["negotiated_prices"]],
meta=[
"negotiation_arrangement",
"name",
"billing_code_type",
"billing_code_type_version",
"billing_code",
"description",
["negotiated_rates", "provider_references"],
],
)
That takes care of the in_network
part of the JSON.这负责 JSON 的
in_network
部分。 The trick is that within the metadata path you want to put the columns which are not nested in a regular list, and the nested ones in the order of nesting (ie ["negotiated_rates", "provider_references"]
).诀窍是在元数据路径中,您希望将未嵌套在常规列表中的列以及嵌套的列按嵌套顺序放置(即
["negotiated_rates", "provider_references"]
)。 There's a similar example in the docs here . 这里的文档中有一个类似的例子。
Then for the other nested part of the JSON you can do this:然后对于 JSON 的其他嵌套部分,您可以这样做:
provider_references_df = pd.json_normalize(
data["provider_references"], "provider_groups", "provider_group_id"
)
And that takes care of the whole thing.这照顾了整个事情。
Is this what you are trying to achieve?这是你想要达到的目标吗?
import pandas as pd
import json
with open("testjson.json", "r") as f:
data = json.load(f)
for k, v in data.items():
print(k)
negotiated_prices_df = pd.json_normalize(data['in_network'], record_path=['negotiated_rates', ['negotiated_prices']], meta = ['negotiation_arrangement','name','billing_code_type','billing_code_type_version', 'billing_code', 'description', ['negotiated_rates', 'provider_references']], errors='ignore').explode('service_code', ignore_index=True)
print(negotiated_prices_df)
Result printed in terminal:终端打印的结果:
negotiated_type negotiated_rate expiration_date service_code billing_class negotiation_arrangement name billing_code_type billing_code_type_version billing_code description negotiated_rates.provider_references
0 fee schedule 296.00 2023-06-30 01 professional ffs nasal/s CPT 2022 31240 Nasal/sinus endoscopy surg [261, 398, 799]
1 fee schedule 296.00 2023-06-30 02 professional ffs nasal/s CPT 2022 31240 Nasal/sinus endoscopy surg [261, 398, 799]
2 fee schedule 296.00 2023-06-30 03 professional ffs nasal/s CPT 2022 31240 Nasal/sinus endoscopy surg [261, 398, 799]
3 fee schedule 296.00 2023-06-30 04 professional ffs nasal/s CPT 2022 31240 Nasal/sinus endoscopy surg [261, 398, 799]
4 fee schedule 296.00 2023-06-30 05 professional ffs nasal/s CPT 2022 31240 Nasal/sinus endoscopy surg [261, 398, 799]
... ... ... ... ... ... ... ... ... ... ... ... ...
687789 negotiated 15461.36 2023-06-30 NaN institutional ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS APR-DRG 39.1 192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS [191]
687790 negotiated 11953.00 2023-06-30 NaN institutional ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS APR-DRG 39.1 192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS [521, 688]
687791 negotiated 12622.15 2023-06-30 NaN institutional ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS APR-DRG 39.1 192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS [1003, 1045, 11, 1174, 133, 149, 177, 186, 251, 27, 564, 649, 683, 697, 705, 764, 827, 836, 837, 87, 937, 938]
687792 negotiated 11864.00 2023-06-30 NaN institutional ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS APR-DRG 39.1 192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS [1176, 319, 974]
687793 negotiated 11229.02 2023-06-30 NaN institutional ffs CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS APR-DRG 39.1 192 CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS [371, 523]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.