繁体   English   中英

将嵌套字典的多级列表转换为单个字典列表

[英]Convert multiple level list of nested dictionaries into single list of dictionaries

我想将嵌套字典的多级列表转换为单个字典列表

输入:

list_ = [
 {'Name': 'Paras Jain',
  'Student': [{'Exam': 90,
               'Grade': 'a',
               'class': [{'age': 10, 'subject': 'hindi'},
                         {'age': 11, 'subject': 'maths'}]},
              {'Exam': 99,
               'Grade': 'b',
               'class': [{'age': 14, 'subject': 'evs'},
                         {'age': 15, 'subject': 'science'}]},
              {'Exam': 97,
               'Grade': 'c',
               'class': [{'age': 10, 'subject': 'history'}]}]},
 {'Name': 'Chunky Pandey',
  'Student': [{'Exam': 89,
               'Grade': 'a',
               'class': [{'age': 9, 'subject': 'no'}]},
              {'Exam': 80, 'Grade': 'b', 'class': []}]},
 {'Name': 'abc', 'Student':[]}
]

所需 output:

[{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10, 'subject': 'hindi'},
 {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11, 'subject': 'maths'},
 {'Exam': 90, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14, 'subject': 'evs'},
 {'Exam': 90, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15, 'subject': 'science'},
 {'Exam': 97, 'Grade': 'c', 'Name': 'Paras Jain', 'age': 10, 'subject': 'history'},
 {'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 9, 'subject': 'no'},
 {'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 'NA', 'subject': 'NA'},
 {'Exam': 'NA', 'Grade': 'NA', 'Name': 'abc', 'age': 'NA', 'subject': 'NA'}]

我们试试看

df = pd.json_normalize(list_, record_path='Student', meta='Name').explode('class')
out = df.join(df.pop('class').apply(pd.Series)).drop(columns=0).to_dict(orient='records')
print(df)

   Exam Grade                              class           Name
0    90     a    {'age': 10, 'subject': 'hindi'}     Paras Jain
0    90     a    {'age': 11, 'subject': 'maths'}     Paras Jain
1    99     b      {'age': 14, 'subject': 'evs'}     Paras Jain
1    99     b  {'age': 15, 'subject': 'science'}     Paras Jain
2    97     c  {'age': 10, 'subject': 'history'}     Paras Jain
3    89     a        {'age': 9, 'subject': 'no'}  Chunky Pandey
4    80     b                                NaN  Chunky Pandey

print(out)

[{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'hindi'}, {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11.0, 'subject': 'maths'}, {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'hindi'}, {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11.0, 'subject': 'maths'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14.0, 'subject': 'evs'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15.0, 'subject': 'science'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14.0, 'subject': 'evs'}, {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15.0, 'subject': 'science'}, {'Exam': 97, 'Grade': 'c', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'history'}, {'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 9.0, 'subject': 'no'}, {'Exam': 80, 'Grade': 'b', 'Name': 'Chunky Pandey', 'age': nan, 'subject': nan}]

首先,让我们修复您的空 class 列表,因此pd.json_normalize不会忽略它:

for i, x in enumerate(list_):
    for j, y in enumerate(x['Student']):
        if not y['class']:
            list_[i]['Student'][j]['class'] = [{}]

然后我们可以使用pd.json_normalize

# Mark the deepest level (Student.class), 
# and all the meta levels (Student.Exam, Student.Grade, and Name):
df = pd.json_normalize(list_, ['Student', 'class'], [['Student', 'Exam'], ['Student', 'Grade'], 'Name'])

# Fix up the column names, we don't need the `Student.` prefix here.
df.columns = df.columns.str.replace('Student.', '', regex=False)

# Convert to dictionary, `records` is what your format is known as.
out = df.to_dict('records')
print(out) # pprint(out, width=150)

Output:

[{'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'hindi'},
 {'Exam': 90, 'Grade': 'a', 'Name': 'Paras Jain', 'age': 11.0, 'subject': 'maths'},
 {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 14.0, 'subject': 'evs'},
 {'Exam': 99, 'Grade': 'b', 'Name': 'Paras Jain', 'age': 15.0, 'subject': 'science'},
 {'Exam': 97, 'Grade': 'c', 'Name': 'Paras Jain', 'age': 10.0, 'subject': 'history'},
 {'Exam': 89, 'Grade': 'a', 'Name': 'Chunky Pandey', 'age': 9.0, 'subject': 'no'},
 {'Exam': 80, 'Grade': 'b', 'Name': 'Chunky Pandey', 'age': nan, 'subject': nan}]

我尝试了以下两种方法,但正在寻找一种更优化的方法:我如何测试2700000 条记录 只是多个 [{}] * 900000第一种方式

# tested with 2700000 records
import time
start_time = time.time()
rows = []
  
for data in list_:
    if data['Student']:
        for row in data['Student']:
            if row["class"]: 
                for in_row in row['class']:
                    in_row['Exam'] = row['Exam']
                    in_row['Grade'] = row['Grade']
                    in_row['age'] = in_row['age']
                    in_row['subject'] = in_row['subject']
                    in_row['Name'] = data['Name']
                    rows.append(in_row)
            else:
                rows.append({
                    'Exam': row['Exam'],
                    'Grade': row['Grade'],
                    'age': '',
                    'subject': '',
                    'Name': data['Name']
                })
    else:
        rows.append({
            'Exam': '',
            'Grade': '',
            'age': '',
            'subject': '',
            'Name': data['Name']
        })


end_time = time.time()
print(end_time-start_time)
# 5.117190837860107

第二种方式

# tested with 2700000 records
import time
start_time = time.time()
rows = []
for data in list_:
    if data['Student']:
        for row in data['Student']:
            row['Exam'] = row['Exam']
            row['Grade'] = row['Grade']
            row['class'] = row['class']
            row['Name'] = data['Name']
            rows.append(row)
    else:
        rows.append({
            'Exam': '',
            'Grade': '',
            'class': [],
            'Name': data['Name']
        })



res = []

for data in rows:
    if data["class"]:
        for in_row in data['class']:
            in_row['Exam'] = data['Exam']
            in_row['Grade'] = data['Grade']
            in_row['age'] = in_row['age']
            in_row['subject'] = in_row['subject']
            in_row['Name'] = data['Name']
            res.append(in_row)
    else:
        res.append({
            'Exam': data['Exam'],
            'Grade': data['Grade'],
            'age': '',
            'subject': '',
            'Name': data['Name']
        })


end_time = time.time()
print(end_time-start_time)
# 15.956519842147827

谁能帮我? 如何将上述两种方式转换为列表理解?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM