Python - 使用嵌套字典和不等长创建 DataFrame

Question

I searched around for some of the similarly worded questions, but I didn't find a solution, so I am making my own question.我搜索了一些措辞相似的问题，但没有找到解决方案，所以我提出了自己的问题。

I am using Python.我正在使用 Python。 I am trying to create a DataFrame from a nested dictionary (with lists), and I can't get the length of the columns correct.我正在尝试从嵌套字典（带有列表）创建一个 DataFrame，但我无法正确获取列的长度。

My data structure is:我的数据结构是：

[
   {
      "Info":{
         "StudentID":1000009,
         "firstName":"Name",
         "middleName":"Middle",
         "lastName":"Last"
      },
      "Pass":[
         {
            "Subject":"Math",
            "Score":98
         },
         {
            "Subject":"Science",
            "Score":100
         }
      ]
   },
   {
      "Info":{
         "StudentID":1000010,
         "firstName":"Name",
         "middleName":"Middle",
         "lastName":"Last"
      },
      "Pass":[
         {
            "Subject":"Math",
            "Score":90
         },
         {
            "Subject":"Science",
            "Score":82
         },
         {
            "Subject":"English",
            "Score":99
         }
      ]
   }
]

I want the DataFrame to be:我希望 DataFrame 是：

       ID   firstName  middleName lastName   Subject1  Subject2 Subject3
0 1000009        Name      Middle     Last       Math   Science      NaN
0 1000009        Name      Middle     Last       Math   Science  English

The data is more complex (more subjects), but what ends up happening is that I get an error that all arrays must be the same length:数据更复杂（更多主题），但最终发生的是我收到一个错误，即所有数组必须具有相同的长度：

ValueError: All arrays must be of the same length

I tried:我试过了：

p = defaultdict(list)

num = len(data)

for i in range(0,num):
    crd = data[i]['Info']['individualId']
    p['crd'].append(crd)
    
    firstName = data[i]['Info']['firstName']
    p['firstName'].append(firstName)

    middleName = data[i]['Info']['middleName ']
    p['firstName'].append(middleName)

    lastName = data[i]['Info']['lastName']
    p['lastName'].append(lastName)

    subjects= len(data[i]['Pass'])
   
    for e in range(0,subjects):
        try:
            exam = data[i]['Pass'][e]['Subject']
            p[f'Subject{e}'].append(subject)
        except:
              break

df = pd.DataFrame.from_dict(p, orient='index')
df

But the data isn't aligned - it must be tied to the correct ID.但数据未对齐 - 它必须与正确的 ID 相关联。 Instead, it lists the info in the order that it appears.相反，它会按出现的顺序列出信息。 In other words, there isn't a missing value in Subject3.换句话说，Subject3 中没有缺失值。

I've also tried creating a list, then creating a new list within the for loop.我也尝试过创建一个列表，然后在 for 循环中创建一个新列表。

rec = []

num = len(data)

for i in range(0,num):
    p = []
    crd = data[i]['Info']['individualId']
    p.append(crd)
    
    firstName = data[i]['Info']['firstName']
    p.append(firstName)

    middleName = data[i]['Info']['middleName ']
    p.append(middleName)

    lastName = data[i]['Info']['lastName']
    p.append(lastName)

    subjects= len(data[i]['Pass'])
   
    for e in range(0,subjects):
        try:
            exam = data[i]['Pass'][e]['Subject']
            p.append(subject)
        except:
              break
     rec.append(p)

df = pd.DataFrame(rec)
df

In this code, I get misaligned info.在这段代码中，我得到了未对齐的信息。 Each column doesn't have the standard info.每列都没有标准信息。 For example, if someone doesn't have a middle name in the data, everything will get shifted to the left.例如，如果某人在数据中没有中间名，那么所有内容都会向左移动。

Any solutions?有什么解决办法吗？

Answer 1

For your data, you can use json_normalize , then a bit of manipulation:对于您的数据，您可以使用json_normalize ，然后进行一些操作：

# put data into a frame
tmp = pd.json_normalize(data)

passes = tmp['Pass'].explode()
out = tmp.drop(columns='Pass').join(pd.DataFrame(passes.tolist(), index=passes.index))

Then out is:然后out的是：

   Info.StudentID Info.firstName Info.middleName Info.lastName  Subject  Score
0         1000009           Name          Middle          Last     Math     98
0         1000009           Name          Middle          Last  Science    100
1         1000010           Name          Middle          Last     Math     90
1         1000010           Name          Middle          Last  Science     82
1         1000010           Name          Middle          Last  English     99

From here, you can pivot the Subject从这里，您可以旋转Subject

Python - 使用嵌套字典和不等长创建 DataFrame

问题描述

1 个解决方案

解决方案1
1 2022-06-14 02:54:40

Python - 使用嵌套字典和不等长创建 DataFrame

问题描述

1 个解决方案

解决方案1 1 2022-06-14 02:54:40

解决方案1
1 2022-06-14 02:54:40