[英]Unfold a nested dictionary with lists into a pandas DataFrame
I have a nested dictionary, whereby the sub-dictionary use lists: 我有一个嵌套字典,子字典使用列表:
nested_dict = {'string1': {69: [1231, 232], 67:[682, 12], 65: [1, 1]},
`string2` :{28672: [82, 23], 22736:[82, 93, 1102, 102], 19423: [64, 23]}, ... }
There are at least two elements in the list for the sub-dictionaries, but there could be more. 列表中至少有两个元素用于子词典,但可能还有更多。
I would like to "unfold" this dictionary into a pandas DataFrame, with one column for the first dictionary keys (eg 'string1', 'string2', ..), one column for the sub-directory keys, one column for the first item in the list, one column for the next item, and so on. 我想将这个字典“展开”成一个pandas DataFrame,第一个字典键有一列(例如'string1','string2',..),一个列用于子目录键,一列用于第一个字典键列表中的项目,下一个项目的一列,依此类推。
Here is what the output should look like: 这是输出应该是什么样子:
col1 col2 col3 col4 col5 col6
string1 69 1231 232
string1 67 682 12
string1 65 1 1
string2 28672 82 23
string2 22736 82 93 1102 102
string2 19423 64 23
Naturally, I try to use pd.DataFrame.from_dict
: 当然,我尝试使用
pd.DataFrame.from_dict
:
new_df = pd.DataFrame.from_dict({(i,j): nested_dict[i][j]
for i in nested_dict.keys()
for j in nested_dict[i].keys()
...
Now I'm stuck. 现在我被卡住了。 And there are many existing problems:
并且存在许多问题:
How do I parse the strings (ie the nested_dict[i].values()
) such that each element is a new pandas DataFrame column? 我如何解析字符串(即
nested_dict[i].values()
),使每个元素都是一个新的pandas DataFrame列?
The above will actually not create a column for each field 以上实际上不会为每个字段创建一列
The above will not fill up the columns with elements, eg string1
should be in each row for the sub-directory key-value pair. 以上内容不会用元素填充列,例如,
string1
应该在子目录键值对的每一行中。 (For col5
and col6
, I can fill the NA with zeros) (对于
col5
和col6
,我可以用零填充NA)
I'm not sure how to name these columns correctly. 我不确定如何正确命名这些列。
Here's a method which uses a recursive generator to unroll the nested dictionaries. 这是一个使用递归生成器展开嵌套字典的方法。 It won't assume that you have exactly two levels, but continues unrolling each
dict
until it hits a list
. 它不会假设您有两个级别,但会继续展开每个
dict
直到它到达list
。
nested_dict = {
'string1': {69: [1231, 232], 67:[682, 12], 65: [1, 1]},
'string2' :{28672: [82, 23], 22736:[82, 93, 1102, 102], 19423: [64, 23]},
'string3': [101, 102]}
def unroll(data):
if isinstance(data, dict):
for key, value in data.items():
# Recursively unroll the next level and prepend the key to each row.
for row in unroll(value):
yield [key] + row
if isinstance(data, list):
# This is the bottom of the structure (defines exactly one row).
yield data
df = pd.DataFrame(list(unroll(nested_dict)))
Because unroll
produces a list of lists rather than dicts, the columns will be named numerically (from 0 to 5 in this case). 因为
unroll
会生成列表而不是dicts,所以列将以数字命名(在本例中为0到5)。 So you need to use rename
to get the column labels you want: 因此,您需要使用
rename
来获取所需的列标签:
df.rename(columns=lambda i: 'col{}'.format(i+1))
This returns the following result (note that the additional string3
entry is also unrolled). 这将返回以下结果(请注意,附加的
string3
条目也将展开)。
col1 col2 col3 col4 col5 col6
0 string1 69 1231 232.0 NaN NaN
1 string1 67 682 12.0 NaN NaN
2 string1 65 1 1.0 NaN NaN
3 string2 28672 82 23.0 NaN NaN
4 string2 22736 82 93.0 1102.0 102.0
5 string2 19423 64 23.0 NaN NaN
6 string3 101 102 NaN NaN NaN
This should give you the result you are looking for, although it's probably not the most elegant solution. 这应该会给你你想要的结果,虽然它可能不是最优雅的解决方案。 There's probably a better (more
pandas
way) to do it. 这可能是更好的(更多的
pandas
方式)。
I parsed your nested dict and built a list of dictionaries (one for each row). 我解析了你的嵌套字典并构建了一个字典列表(每行一个)。
# some sample input
nested_dict = {
'string1': {69: [1231, 232], 67:[682, 12], 65: [1, 1]},
'string2' :{28672: [82, 23], 22736:[82, 93, 1102, 102], 19423: [64, 23]},
'string3' :{28673: [83, 24], 22737:[83, 94, 1103, 103], 19424: [65, 24]}
}
# new list is what we will use to hold each row
new_list = []
for k1 in nested_dict:
curr_dict = nested_dict[k1]
for k2 in curr_dict:
new_dict = {'col1': k1, 'col2': k2}
new_dict.update({'col%d'%(i+3): curr_dict[k2][i] for i in range(len(curr_dict[k2]))})
new_list.append(new_dict)
# create a DataFrame from new list
df = pd.DataFrame(new_list)
The output: 输出:
col1 col2 col3 col4 col5 col6
0 string2 28672 82 23 NaN NaN
1 string2 22736 82 93 1102.0 102.0
2 string2 19423 64 23 NaN NaN
3 string3 19424 65 24 NaN NaN
4 string3 28673 83 24 NaN NaN
5 string3 22737 83 94 1103.0 103.0
6 string1 65 1 1 NaN NaN
7 string1 67 682 12 NaN NaN
8 string1 69 1231 232 NaN NaN
There is an assumption that the input will always contain enough data to create a col1
and a col2
. 假设输入将始终包含足够的数据来创建
col1
和col2
。
I loop through nested_dict
. 我遍历
nested_dict
。 It is assumed that each element of nested_dict
is also a dictionary. 假设
nested_dict
每个元素也是字典。 We loop through that dictionary as well ( curr_dict
). 我们也循环遍历该字典(
curr_dict
)。 The keys k1
and k2
are used to populate col1
and col2
. 键
k1
和k2
用于填充col1
和col2
。 For the rest of the keys, we iterate through the list contents and add a column for each element. 对于其余的键,我们遍历列表内容并为每个元素添加一列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.