[英]Flattening a nested dictionary with unique keys for each dictionary?
I have a dictionary that has the following format:我有一本具有以下格式的字典:
´´´{'7453':
{'2H':
{'1155':
{'in': [{'playerId': 281253}, {'playerId': 169212}],
'out': [{'playerId': 449240}, {'playerId': 257943}]},
'2011':
{'in': [{'playerId': 449089}],
'out': [{'playerId': 69374}]},
'2568':
{'in': [{'playerId': 481900}],
'out': [{'playerId': 1735}]}}},
'7454':
{'1H':
{'2833':
{'in': [{'playerId': 56390}],
'out': [{'playerId': 208089}]}},
'2H':
{'687':
{'in': [{'playerId': 574}],
'out': [{'playerId': 578855}]},
'1627':
{'in': [{'playerId': 477400}],
'out': [{'playerId': 56386}]},
'2725':
{'in': [{'playerId': 56108}],
'out': [{'playerId': 56383}]}}}}
´´´
I need the data in the following format (in a df): https://i.stack.imgur.com/GltRb.png我需要以下格式的数据(df): https://i.stack.imgur.com/GltRb.png
That means that I would like to flatten my data so that I have the id: "7453", half: "H2", minute: "2011", type: "out", playerId: "281253".这意味着我想展平我的数据,以便我有 id:“7453”,一半:“H2”,分钟:“2011”,类型:“out”,playerId:“281253”。 Also, I need one record per player, but that still has all the other data (id, half etc.)
另外,我需要每个玩家一个记录,但仍然包含所有其他数据(id、half 等)
I have been struggling with this for days, and can't seem to find any solution for this particular problem.我已经为此苦苦挣扎了好几天,似乎无法为这个特定问题找到任何解决方案。 Until now I have been able to solve it either using, pd.json_normalize() or flatten_json().
到目前为止,我已经能够使用 pd.json_normalize() 或 flatten_json() 来解决它。 But it just doesn't make it for me, in this case.
但在这种情况下,它不适合我。 If anyone could point me in the right direction or write some code that could solve this, it would be much appreciated!
如果有人能指出我正确的方向或编写一些可以解决此问题的代码,将不胜感激!
FYI: The biggest struggle I have is that I actually need a header/column for my keys.仅供参考:我最大的困难是我实际上需要一个标题/列来存放我的键。
pandas has explode
to unwrap lists but I am not aware of a method for dictionaries. pandas 已经
explode
展开列表,但我不知道字典的方法。
As your dictionary is extremely well structured, you can try由于您的字典结构非常好,您可以尝试
[28]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd
...: .Series).stack().explode().apply(pd.Series).reset_index().rename(column
...: s={'level_0': 'teamId', 'level_1': 'matchPeriod', 'level_2': 'eventSec'
...: , 'level_3': 'type'})
Out[28]:
teamId matchPeriod eventSec type playerId
0 7453 2H 1155 in 281253
1 7453 2H 1155 in 169212
2 7453 2H 1155 out 449240
3 7453 2H 1155 out 257943
4 7453 2H 2011 in 449089
.. ... ... ... ... ...
11 7454 2H 1627 out 56386
12 7454 2H 2725 in 56108
13 7454 2H 2725 out 56383
14 7454 1H 2833 in 56390
15 7454 1H 2833 out 208089
Although extremely ugly chaining the Series
constructor and stack
will build up the DataFrame level by level.尽管将
Series
构造函数和stack
链接起来非常难看,但会逐级构建 DataFrame。
Update: In principle you can pass a dictionary to the DataFrame
and Series
constructors更新:原则上,您可以将字典传递给
DataFrame
和Series
构造函数
In [2]: d
Out[2]:
{'7453': {'2H': {'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
'out': [{'playerId': 449240}, {'playerId': 257943}]},
'2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
'2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}},
'7454': {'1H': {'2833': {'in': [{'playerId': 56390}],
'out': [{'playerId': 208089}]}},
'2H': {'687': {'in': [{'playerId': 574}], 'out': [{'playerId': 578855}]},
'1627': {'in': [{'playerId': 477400}], 'out': [{'playerId': 56386}]},
'2725': {'in': [{'playerId': 56108}], 'out': [{'playerId': 56383}]}}}}
In [3]: pd.DataFrame(d)
Out[3]:
7453 7454
2H {'1155': {'in': [{'pl... {'687': {'in': [{'pla...
1H NaN {'2833': {'in': [{'pl...
In [4]: pd.Series(d)
Out[4]:
7453 {'2H': {'1155': {'in'...
7454 {'1H': {'2833': {'in'...
dtype: object
As they are 2-dimensional and 1-dimensional data structures respectively, they also expect a dictionary with 2 and 1 level deep nesting respectively.由于它们分别是二维和一维数据结构,因此它们还期望字典分别具有 2 级和 1 级深度嵌套。 The
DataFrame
interprets your 'teamId' as index and 'matchPeriod' as columns and the values are the values of the dictionaries like in DataFrame
将您的“teamId”解释为索引,将“matchPeriod”解释为列,值是字典的值,如
In [5]: d['7453']['2H']
Out[5]:
{'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
'out': [{'playerId': 449240}, {'playerId': 257943}]},
'2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
'2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}
The Series
behaves the same way, but with only one level. Series
的行为方式相同,但只有一个级别。
In [6]: d['7453']
Out[6]:
{'2H': {'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
'out': [{'playerId': 449240}, {'playerId': 257943}]},
'2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
'2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}}
is your first level.是你的第一级。 Now this is a dictionary again, so you can pass it the the
Series
constructor as well现在这又是一个字典,所以你也可以将它传递给
Series
构造函数
In [7]: pd.Series(d['7453'])
Out[7]:
2H {'1155': {'in': [{'pl...
dtype: object
The apply
function allows you to do this for every row of the Series
apply
function 允许您对Series
的每一行执行此操作
In [8]: pd.Series(d).apply(pd.Series)
Out[8]:
2H 1H
7453 {'1155': {'in': [{'pl... NaN
7454 {'687': {'in': [{'pla... {'2833': {'in': [{'pl...
Now you arrive at the same result as with the DataFrame
constructor.现在您得到与
DataFrame
构造函数相同的结果。 This is called broadcasting.这称为广播。 Each value of the original
Series
no becomes its own Series
and the index is used as column labels.原始
Series
no 的每个值都成为其自己的Series
,并且索引用作列标签。 By calling stack
you intead tell pandas to give you a series intead and stack all the labels to a MultiIndex
if needed.通过调用
stack
你 intead 告诉 pandas 给你一个系列 intead 并在需要时将所有标签堆叠到MultiIndex
。
In [9]: pd.Series(d).apply(pd.Series).stack()
Out[9]:
7453 2H {'1155': {'in': [{'pl...
7454 2H {'687': {'in': [{'pla...
1H {'2833': {'in': [{'pl...
dtype: object
Now you again have a Series (with a 2d index) where each value is a dictionary which - again - can be passed to the Series
constructor.现在您再次拥有一个 Series(带有 2d 索引),其中每个值都是一个字典,可以再次将其传递给
Series
构造函数。 So if you repeat this chain of apply(pd.Series).stack()
you get所以如果你重复这个
apply(pd.Series).stack()
链,你会得到
In [10]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack()
Out[10]:
7453 2H 1155 {'in': [{'playerId': ...
2011 {'in': [{'playerId': ...
2568 {'in': [{'playerId': ...
7454 2H 687 {'in': [{'playerId': ...
1627 {'in': [{'playerId': ...
2725 {'in': [{'playerId': ...
1H 2833 {'in': [{'playerId': ...
dtype: object
Now you again have a Series (with a 3d index) where each value is a dictionary which - again - can be passed to the Series
constructor.现在您又拥有了一个 Series(带有 3d 索引),其中每个值都是一个字典,可以再次将其传递给
Series
构造函数。
In [11]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack()
Out[11]:
7453 2H 1155 in [{'playerId': 281253}...
out [{'playerId': 449240}...
2011 in [{'playerId': 449089}]
out [{'playerId': 69374}]
2568 in [{'playerId': 481900}]
out [{'playerId': 1735}]
7454 2H 687 in [{'playerId': 574}]
out [{'playerId': 578855}]
1627 in [{'playerId': 477400}]
out [{'playerId': 56386}]
2725 in [{'playerId': 56108}]
out [{'playerId': 56383}]
1H 2833 in [{'playerId': 56390}]
out [{'playerId': 208089}]
dtype: object
This is a special case as now your values are no longer dictionaries but lists (with one element each).这是一种特殊情况,因为现在您的值不再是字典而是列表(每个都有一个元素)。 For lists (and unfortunately not for dictionaries) there is the
explode()
method in pandas to create a new row for each list element.对于列表(不幸的是,不是字典),pandas 中的
explode()
方法可以为每个列表元素创建一个新行。
In [13]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode()
Out[13]:
7453 2H 1155 in {'playerId': 281253}
in {'playerId': 169212}
out {'playerId': 449240}
out {'playerId': 257943}
2011 in {'playerId': 449089}
...
7454 2H 1627 out {'playerId': 56386}
2725 in {'playerId': 56108}
out {'playerId': 56383}
1H 2833 in {'playerId': 56390}
out {'playerId': 208089}
dtype: object
unpacks each list.解压每个列表。 Now you again have a Series (with a 4d index) where each value is a dictionary which - again - can be passed to the
Series
constructor.现在您再次拥有一个 Series(带有 4d 索引),其中每个值都是一个字典,可以再次将其传递给
Series
构造函数。
In [14]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode().apply(pd.Series).stack()
Out[14]:
7453 2H 1155 in playerId 281253
playerId 169212
out playerId 449240
playerId 257943
2011 in playerId 449089
...
7454 2H 1627 out playerId 56386
2725 in playerId 56108
out playerId 56383
1H 2833 in playerId 56390
out playerId 208089
dtype: int64
With these five iterations of applying the Series
constructor to your dictionary and reshaping the data until you can apply it again, you got your dictionary fully unpacked.通过将
Series
构造函数应用于您的字典并重塑数据直到您可以再次应用它的这五次迭代,您的字典已完全解包。
In order to match your desired result you can make all levels of the index to a column with reset_index
.为了匹配您想要的结果,您可以使用
reset_index
将所有级别的索引设置为列。
In [15]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode().apply(pd.Series).stack().reset_index()
Out[15]:
level_0 level_1 level_2 level_3 level_4 0
0 7453 2H 1155 in playerId 281253
1 7453 2H 1155 in playerId 169212
2 7453 2H 1155 out playerId 449240
3 7453 2H 1155 out playerId 257943
4 7453 2H 2011 in playerId 449089
.. ... ... ... ... ... ...
11 7454 2H 1627 out playerId 56386
12 7454 2H 2725 in playerId 56108
13 7454 2H 2725 out playerId 56383
14 7454 1H 2833 in playerId 56390
15 7454 1H 2833 out playerId 208089
Neither the Series nor the index levels had names.系列和索引级别都没有名称。 By default it uses the column number (
0
) for the values (which should be 'playerId') and level_0
to level_4
for the index levels.默认情况下,它使用列号 (
0
) 作为值(应该是“playerId”), level_0
到level_4
作为索引级别。 In order to set these appropriately one way is to rename the Series
before calling reset_index
and rename the levels
with rename
afterwards.为了适当地设置这些,一种方法是在调用
reset_index
之前重命名Series
,然后使用rename
重命名levels
。
I hope that helps我希望这会有所帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.