[英]How to convert csv to json with multi-level nesting using pandas
I've tried to follow a bunch of answers I've seen on SO, but I'm really stuck here. 我试着按照我在SO上看到的一堆答案,但我真的被困在这里。 I'm trying to convert a CSV to JSON.
我正在尝试将CSV转换为JSON。
The JSON schema has multiple levels of nesting and some of the values in the CSV will be shared. JSON模式具有多个嵌套级别,并且将共享CSV中的某些值。
Here's a link to one record in the CSV. 这是 CSV中一条记录的链接 。
Think of this sample as two different parties attached to one document. 将此示例视为附加到一个文档的两个不同方。
The fields on the document (document_source_id, document_amount, record_date, source_url, document_file_url, document_type__title, apn, situs_county_id, state_code) should not duplicate. 文档上的字段(document_source_id,document_amount,record_date,source_url,document_file_url,document_type__title,apn,situs_county_id,state_code)不应重复。
While the fields of each entity are unique. 虽然每个实体的字段都是唯一的。
I've tried to nest these using a complex groupby statement, but am stuck getting the data into my schema. 我试图使用复杂的groupby语句嵌套这些,但是我很难将数据存入我的模式。
Here's what I've tried. 这是我尝试过的。 It doesn't contain all fields because I'm having a difficult time understanding what it all means.
它不包含所有字段,因为我很难理解它的含义。
j = (df.groupby(['state_code',
'record_date',
'situs_county_id',
'document_type__title',
'document_file_url',
'document_amount',
'source_url'], as_index=False)
.apply(lambda x: x[['source_url']].to_dict('r'))
.reset_index()
.rename(columns={0:'metadata', 1:'parcels'})
.to_json(orient='records'))
Here's how the sample CSV should output 以下是示例CSV应如何输出
{
"metadata":{
"source_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentDetail?doc_id=2019012901225004",
"document_file_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id=2019012901225004"
},
"state_code":"NY",
"nested_data":{
"parcels":[
{
"apn":"3972-61",
"situs_county_id":"36005"
}
],
"participants":[
{
"entity":{
"name":"5 AIF WILLOW, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantee"
},
{
"entity":{
"name":"5 ARCH INCOME FUND 2, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantor"
}
]
},
"record_date":"01/31/2019",
"situs_county_id":"36005",
"document_source_id":"2019012901225004",
"document_type__title":"ASSIGNMENT, MORTGAGE"
}
You might need to use the json_normalize function from pandas.io.json 您可能需要使用pandas.io.json中的json_normalize函数
from pandas.io.json import json_normalize
import csv
li = []
with open('filename.csv', 'r') as f:
reader = csv.DictReader(csvfile)
for row in reader:
li.append(row)
df = json_normalize(li)
Here , we are creating a list of dictionaries from the csv file and creating a dataframe from the function json_normalize. 在这里,我们从csv文件创建一个字典列表,并从函数json_normalize创建一个数据帧。
Below is one way to export your data: 以下是导出数据的一种方法:
# all columns used in groupby()
grouped_cols = ['state_code', 'record_date', 'situs_county_id', 'document_source_id'
, 'document_type__title', 'source_url', 'document_file_url']
# adjust some column names to map to those in the 'entity' node in the desired JSON
situs_mapping = {
'street_number_street_name': 'situs_street'
, 'city_name': 'situs_city'
, 'unit': 'situs_unit'
, 'state_code': 'state_code'
, 'zipcode_full': 'situs_zip'
}
# define columns used for 'entity' node. python 2 need to adjust to the syntax
entity_cols = ['name', *situs_mapping.values()]
#below for python 2#
#entity_cols = ['name'] + list(situs_mapping.values())
# specify output fields
output_cols = ['metadata','state_code','nested_data','record_date'
, 'situs_county_id', 'document_source_id', 'document_type__title']
# define a function to get nested_data
def get_nested_data(d):
return {
'parcels': d[['apn', 'situs_county_id']].drop_duplicates().to_dict('r')
, 'participants': d[['entity', 'participation_type']].to_dict('r')
}
j = (df.rename(columns=situs_mapping)
.assign(entity=lambda x: x[entity_cols].to_dict('r'))
.groupby(grouped_cols)
.apply(get_nested_data)
.reset_index()
.rename(columns={0:'nested_data'})
.assign(metadata=lambda x: x[['source_url', 'document_file_url']].to_dict('r'))[output_cols]
.to_json(orient="records")
)
print(j)
Note: If participants
contain duplicates and must run drop_duplicates() as we do on parcels
, then assign(entity
) can be moved to defining the participants
in the get_nested_data()
function: 注意:如果
participants
包含重复和必须运行drop_duplicates()为我们做parcels
,然后assign(entity
)可以移动到限定participants
在get_nested_data()
函数:
, 'participants': d[['participation_type', *entity_cols]] \
.drop_duplicates() \
.assign(entity=lambda x: x[entity_cols].to_dict('r')) \
.loc[:,['entity', 'participation_type']] \
.to_dict('r')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.