[英]Convert multiple level list of nested dictionaries into single list of dictionaries
我想将嵌套字典的多级列表转换为单个字典列表
输入:
list_ = [
{'Name': 'Paras Jain',
'Student': [{'Exam': 90,
'Grade': 'a',
'class': [{'age': 10, 'subject': 'hindi'},
{'age': 11, 'subject': 'maths'}]},
{'Exam': 99,
'Grade': 'b',
'class': [{'age': 14, 'subject': 'evs'},
{'age': 15, 'subject': 'science'}]},
{'Exam': 97,
'Grade': 'c',
'class': [{'age': 10, 'subject': 'history'}]}]},
{'Name': 'Chunky Pandey',
'Student': [{'Exam': 89,
'Grade': 'a',
'class': [{'age': 9, 'subject': 'no'}]},
{'Exam': 80, 'Grade': 'b', 'class': []}]},
{'Name': 'abc', 'Student':[]}
]
所需 output:
[{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10, 'subject': 'hindi'},
{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11, 'subject': 'maths'},
{'Exam': 90, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14, 'subject': 'evs'},
{'Exam': 90, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15, 'subject': 'science'},
{'Exam': 97, 'Grade': 'c', 'Name': 'Paras Jain', 'age': 10, 'subject': 'history'},
{'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 9, 'subject': 'no'},
{'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 'NA', 'subject': 'NA'},
{'Exam': 'NA', 'Grade': 'NA', 'Name': 'abc', 'age': 'NA', 'subject': 'NA'}]
我们试试看
df = pd.json_normalize(list_, record_path='Student', meta='Name').explode('class')
out = df.join(df.pop('class').apply(pd.Series)).drop(columns=0).to_dict(orient='records')
print(df)
Exam Grade class Name
0 90 a {'age': 10, 'subject': 'hindi'} Paras Jain
0 90 a {'age': 11, 'subject': 'maths'} Paras Jain
1 99 b {'age': 14, 'subject': 'evs'} Paras Jain
1 99 b {'age': 15, 'subject': 'science'} Paras Jain
2 97 c {'age': 10, 'subject': 'history'} Paras Jain
3 89 a {'age': 9, 'subject': 'no'} Chunky Pandey
4 80 b NaN Chunky Pandey
print(out)
[{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'hindi'}, {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11.0, 'subject': 'maths'}, {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'hindi'}, {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11.0, 'subject': 'maths'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14.0, 'subject': 'evs'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15.0, 'subject': 'science'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14.0, 'subject': 'evs'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15.0, 'subject': 'science'}, {'Exam': 97, 'Grade': 'c', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'history'}, {'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 9.0, 'subject': 'no'}, {'Exam': 80, 'Grade': 'b', 'Name': 'Chunky Pandey', 'age': nan, 'subject': nan}]
首先,让我们修复您的空 class 列表,因此pd.json_normalize
不会忽略它:
for i, x in enumerate(list_):
for j, y in enumerate(x['Student']):
if not y['class']:
list_[i]['Student'][j]['class'] = [{}]
然后我们可以使用pd.json_normalize
:
# Mark the deepest level (Student.class),
# and all the meta levels (Student.Exam, Student.Grade, and Name):
df = pd.json_normalize(list_, ['Student', 'class'], [['Student', 'Exam'], ['Student', 'Grade'], 'Name'])
# Fix up the column names, we don't need the `Student.` prefix here.
df.columns = df.columns.str.replace('Student.', '', regex=False)
# Convert to dictionary, `records` is what your format is known as.
out = df.to_dict('records')
print(out) # pprint(out, width=150)
Output:
[{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'hindi'},
{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11.0, 'subject': 'maths'},
{'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14.0, 'subject': 'evs'},
{'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15.0, 'subject': 'science'},
{'Exam': 97, 'Grade': 'c', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'history'},
{'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 9.0, 'subject': 'no'},
{'Exam': 80, 'Grade': 'b', 'Name': 'Chunky Pandey', 'age': nan, 'subject': nan}]
我尝试了以下两种方法,但正在寻找一种更优化的方法:我如何测试2700000 条记录。 只是多个 [{}] * 900000第一种方式
# tested with 2700000 records
import time
start_time = time.time()
rows = []
for data in list_:
if data['Student']:
for row in data['Student']:
if row["class"]:
for in_row in row['class']:
in_row['Exam'] = row['Exam']
in_row['Grade'] = row['Grade']
in_row['age'] = in_row['age']
in_row['subject'] = in_row['subject']
in_row['Name'] = data['Name']
rows.append(in_row)
else:
rows.append({
'Exam': row['Exam'],
'Grade': row['Grade'],
'age': '',
'subject': '',
'Name': data['Name']
})
else:
rows.append({
'Exam': '',
'Grade': '',
'age': '',
'subject': '',
'Name': data['Name']
})
end_time = time.time()
print(end_time-start_time)
# 5.117190837860107
第二种方式
# tested with 2700000 records
import time
start_time = time.time()
rows = []
for data in list_:
if data['Student']:
for row in data['Student']:
row['Exam'] = row['Exam']
row['Grade'] = row['Grade']
row['class'] = row['class']
row['Name'] = data['Name']
rows.append(row)
else:
rows.append({
'Exam': '',
'Grade': '',
'class': [],
'Name': data['Name']
})
res = []
for data in rows:
if data["class"]:
for in_row in data['class']:
in_row['Exam'] = data['Exam']
in_row['Grade'] = data['Grade']
in_row['age'] = in_row['age']
in_row['subject'] = in_row['subject']
in_row['Name'] = data['Name']
res.append(in_row)
else:
res.append({
'Exam': data['Exam'],
'Grade': data['Grade'],
'age': '',
'subject': '',
'Name': data['Name']
})
end_time = time.time()
print(end_time-start_time)
# 15.956519842147827
谁能帮我? 如何将上述两种方式转换为列表理解?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.