[英]Python - Creating DataFrame with Nested Dictionary and Unequal Lengths
I searched around for some of the similarly worded questions, but I didn't find a solution, so I am making my own question.我搜索了一些措辞相似的问题,但没有找到解决方案,所以我提出了自己的问题。
I am using Python.我正在使用 Python。 I am trying to create a DataFrame from a nested dictionary (with lists), and I can't get the length of the columns correct.
我正在尝试从嵌套字典(带有列表)创建一个 DataFrame,但我无法正确获取列的长度。
My data structure is:我的数据结构是:
[
{
"Info":{
"StudentID":1000009,
"firstName":"Name",
"middleName":"Middle",
"lastName":"Last"
},
"Pass":[
{
"Subject":"Math",
"Score":98
},
{
"Subject":"Science",
"Score":100
}
]
},
{
"Info":{
"StudentID":1000010,
"firstName":"Name",
"middleName":"Middle",
"lastName":"Last"
},
"Pass":[
{
"Subject":"Math",
"Score":90
},
{
"Subject":"Science",
"Score":82
},
{
"Subject":"English",
"Score":99
}
]
}
]
I want the DataFrame to be:我希望 DataFrame 是:
ID firstName middleName lastName Subject1 Subject2 Subject3
0 1000009 Name Middle Last Math Science NaN
0 1000009 Name Middle Last Math Science English
The data is more complex (more subjects), but what ends up happening is that I get an error that all arrays must be the same length:数据更复杂(更多主题),但最终发生的是我收到一个错误,即所有数组必须具有相同的长度:
ValueError: All arrays must be of the same length
I tried:我试过了:
p = defaultdict(list)
num = len(data)
for i in range(0,num):
crd = data[i]['Info']['individualId']
p['crd'].append(crd)
firstName = data[i]['Info']['firstName']
p['firstName'].append(firstName)
middleName = data[i]['Info']['middleName ']
p['firstName'].append(middleName)
lastName = data[i]['Info']['lastName']
p['lastName'].append(lastName)
subjects= len(data[i]['Pass'])
for e in range(0,subjects):
try:
exam = data[i]['Pass'][e]['Subject']
p[f'Subject{e}'].append(subject)
except:
break
df = pd.DataFrame.from_dict(p, orient='index')
df
But the data isn't aligned - it must be tied to the correct ID.但数据未对齐 - 它必须与正确的 ID 相关联。 Instead, it lists the info in the order that it appears.
相反,它会按出现的顺序列出信息。 In other words, there isn't a missing value in Subject3.
换句话说,Subject3 中没有缺失值。
I've also tried creating a list, then creating a new list within the for loop.我也尝试过创建一个列表,然后在 for 循环中创建一个新列表。
rec = []
num = len(data)
for i in range(0,num):
p = []
crd = data[i]['Info']['individualId']
p.append(crd)
firstName = data[i]['Info']['firstName']
p.append(firstName)
middleName = data[i]['Info']['middleName ']
p.append(middleName)
lastName = data[i]['Info']['lastName']
p.append(lastName)
subjects= len(data[i]['Pass'])
for e in range(0,subjects):
try:
exam = data[i]['Pass'][e]['Subject']
p.append(subject)
except:
break
rec.append(p)
df = pd.DataFrame(rec)
df
In this code, I get misaligned info.在这段代码中,我得到了未对齐的信息。 Each column doesn't have the standard info.
每列都没有标准信息。 For example, if someone doesn't have a middle name in the data, everything will get shifted to the left.
例如,如果某人在数据中没有中间名,那么所有内容都会向左移动。
Any solutions?有什么解决办法吗?
For your data, you can use json_normalize
, then a bit of manipulation:对于您的数据,您可以使用
json_normalize
,然后进行一些操作:
# put data into a frame
tmp = pd.json_normalize(data)
passes = tmp['Pass'].explode()
out = tmp.drop(columns='Pass').join(pd.DataFrame(passes.tolist(), index=passes.index))
Then out
is:然后
out
的是:
Info.StudentID Info.firstName Info.middleName Info.lastName Subject Score
0 1000009 Name Middle Last Math 98
0 1000009 Name Middle Last Science 100
1 1000010 Name Middle Last Math 90
1 1000010 Name Middle Last Science 82
1 1000010 Name Middle Last English 99
From here, you can pivot the Subject
从这里,您可以旋转
Subject
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.