[英]Convert multiple nested JSON to CSV in python
我有一个 JSON 并且我想将其转换为 CSV 但问题是 json 是多个嵌套的并且内部字段并不总是具有相同数量的对象。
例如
套件 1 有 5 个产品,套件 2 有 3 个产品(以及两种情况下的产品数量)
套件 1:
"kit":{
"products":[
{
"product":"PP001",
"quantity":1
},
{
"product":"PS001",
"quantity":1
},
{
"product":"PL001",
"quantity":1
},
{
"product":"FIN1187",
"quantity":3
},
{
"product":"FSS001",
"quantity":4
}
],
"kit_client":"Lumax Mannoh Allied Technologies Limited",
"kit_name":"KIT1187",
"kit_info":"Gear Lever TACO_FLC",
"components_per_kit":66
},
套件 2:
"kit":{
"products":[
{
"product":"CRT6423",
"quantity":1
},
{
"product":"CIN1198A",
"quantity":2
},
{
"product":"CSS001",
"quantity":3
}
],
"kit_client":"Lumax Mannoh Allied Technologies Limited",
"kit_name":"KIT1198B",
"kit_info":"Floor Sealing Assy_Crate",
"components_per_kit":72
},
"flow":"LMXMNH_Manesar_Nashik_Floor Sealing Assy W501",
"asked_quantity":3,
"alloted_quantity":3
我尝试json_normalize
但它使外部字典变平。 我希望 output 看起来像这样:
transaction_no dispatch_date send_from_warehouse sales_order flow_name kit_name asked_quantity alloted_quantity product1 product1 quantity product2 product2 quantity...( to the maximum product in all JSON)
完整的 JSON:
json_normalize
是用于简单事情的好工具。 当您有一个深度嵌套的 json 时,最好使用递归自定义 function手动处理它。
在这里,您要保留所有具有即时数据的键,但应编号的产品除外。
一种可能的方法是构建一个集合来保留字段名称,并递归地为数据构建一个字典列表。
data = json.loads(js)
def find_keys(data, keys=None, lst= None, cur = None):
if keys is None:
keys = set() # will contain the field names
lst = [] # list of dict for the data
cur = {} # current data row
if isinstance(data, list):
for sub in data:
cur = cur.copy() # create a new row for each item in of a list
lst.append(cur)
find_keys(sub, keys, lst, cur)
elif isinstance(data, dict):
for k,v in data.items():
if k == 'products': # special processing for products
for i,p in enumerate(v, 1):
for (k1, v1) in p.items():
keys.add(k1 + str(i))
cur[k1 + str(i)] = v1
elif isinstance(v, (list, dict)):
cur = cur.copy() # a new row for each nested json
lst.append(cur)
find_keys(v, keys, lst, cur)
else:
keys.add(k) # a plain data (number or string): feed the row
cur[k] = v
return lst, keys
lst, keys = find_keys(data)
# sort the products to come after the other keys
fieldnames = sorted(keys, key=lambda k: 1 * 2*int(k[8:])
if k.startswith('quantity')
else 2*int(k[7:]) if k.startswith('product') else 0)
# and use the csv module here
with open('data.csv', newline='') as fd:
wr = csv.DictWriter(fd, fieldnames)
_ = wr.writeheader()
wr.writerows(lst)
print(fd.getvalue())
# or build a dataframe
df = pd.DataFrame(lst, columns=fieldnames)
如果您只想要列的子集,则可以使用reindex
:
columns = ['asked_quantity', 'freight_charges', 'driver_name', 'sales_order',
'id', 'transport_by', 'alloted_quantity', 'is_delivered', 'kit_info',
'dispatch_date', 'expected_delivery', 'vehicle_number', 'vehicle_type',
'remarks', 'kit_name', 'lr_number', 'owner', 'transaction_no', 'kit_client',
'driver_number', 'send_from_warehouse', 'flow', 'model', 'components_per_kit',
'product1', 'quantity1', 'quantity2', 'product2',
'quantity3', 'product3', 'quantity4', 'product4', 'quantity5', 'product5',
'product6', 'quantity6'
]
df = df.reindex(columns=columns)
从这样的 json 中提取数据的“经典”方法如下:
d = json.load(open("my_file.json"))
df = pd.json_normalize(d, record_path=["flows", "kit", "products"],
meta=["transaction_no", "dispatch_date", "send_from_warehouse", "sales_order",
["flows", "flow"],
["flows", "kit", "kit_name"],
["flows", "asked_quantity"],
["flows", "alloted_quantity"]
])
output如下:
product quantity transaction_no dispatch_date send_from_warehouse sales_order flows.flow flows.kit.kit_name flows.asked_quantity flows.alloted_quantity
0 PP001 1 2324 2020-08-11T04:40:34.876000Z Yantraksh Logistics Private limited_GGNPC1 105 LMXMNH_Manesar_Nashik_Transmission Gear Leaver... KIT1162A 3 3
1 PS001 1 2324 2020-08-11T04:40:34.876000Z Yantraksh Logistics Private limited_GGNPC1 105 LMXMNH_Manesar_Nashik_Transmission Gear Leaver... KIT1162A 3 3
2 PL001 1 2324 2020-08-11T04:40:34.876000Z Yantraksh Logistics Private limited_GGNPC1 105 LMXMNH_Manesar_Nashik_Transmission Gear Leaver... KIT1162A 3 3
这是否回答你的问题? 要为第一个产品创建一个列,为第二个产品创建一个列等,您可以进行一些旋转。
您的 JSON 有一个非常简单的方法
json_normalize()
获取第一遍记录(每件套件)explode()
产品to_dict(orient="records")
json_normalize()
再次扩展产品中的字典kit = [{'kit': {'products': [{'product': 'PP001', 'quantity': 1},
{'product': 'PS001', 'quantity': 1},
{'product': 'PL001', 'quantity': 1},
{'product': 'FIN1187', 'quantity': 3},
{'product': 'FSS001', 'quantity': 4}],
'kit_client': 'Lumax Mannoh Allied Technologies Limited',
'kit_name': 'KIT1187',
'kit_info': 'Gear Lever TACO_FLC',
'components_per_kit': 66}},
{'kit': {'products': [{'product': 'CRT6423', 'quantity': 1},
{'product': 'CIN1198A', 'quantity': 2},
{'product': 'CSS001', 'quantity': 3}],
'kit_client': 'Lumax Mannoh Allied Technologies Limited',
'kit_name': 'KIT1198B',
'kit_info': 'Floor Sealing Assy_Crate',
'components_per_kit': 72},
'flow': 'LMXMNH_Manesar_Nashik_Floor Sealing Assy W501',
'asked_quantity': 3,
'alloted_quantity': 3}]
df = pd.json_normalize(pd.json_normalize(kit)\
.explode("kit.products").to_dict(orient="records"))
print(df.loc[[0,1,6,7]].to_string(index=False))
样品 output
kit.kit_client kit.kit_name kit.kit_info kit.components_per_kit flow asked_quantity alloted_quantity kit.products.product kit.products.quantity
Lumax Mannoh Allied Technologies Limited KIT1187 Gear Lever TACO_FLC 66 NaN NaN NaN PP001 1
Lumax Mannoh Allied Technologies Limited KIT1187 Gear Lever TACO_FLC 66 NaN NaN NaN PS001 1
Lumax Mannoh Allied Technologies Limited KIT1198B Floor Sealing Assy_Crate 72 LMXMNH_Manesar_Nashik_Floor Sealing Assy W501 3.0 3.0 CIN1198A 2
Lumax Mannoh Allied Technologies Limited KIT1198B Floor Sealing Assy_Crate 72 LMXMNH_Manesar_Nashik_Floor Sealing Assy W501 3.0 3.0 CSS001 3
外部链接上的 JSON 深度为三层。 完全相同的模式,你有 dataframe。
(pd.json_normalize(pd.json_normalize(pd.json_normalize(kit)
.explode("flows")
.to_dict(orient="records"))
.explode("flows.kit.products")
.to_dict(orient="records"))
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.