简体   繁体   中英

Flattening a nested dictionary with unique keys for each dictionary?

I have a dictionary that has the following format:

´´´{'7453': 
      {'2H': 
         {'1155': 
            {'in': [{'playerId': 281253}, {'playerId': 169212}], 
            'out': [{'playerId': 449240}, {'playerId': 257943}]},
          '2011': 
            {'in': [{'playerId': 449089}], 
            'out': [{'playerId': 69374}]}, 
          '2568': 
            {'in': [{'playerId': 481900}], 
            'out': [{'playerId': 1735}]}}}, 
    '7454': 
       {'1H': 
          {'2833': 
             {'in': [{'playerId': 56390}], 
             'out': [{'playerId': 208089}]}}, 
        '2H': 
          {'687': 
             {'in': [{'playerId': 574}], 
             'out': [{'playerId': 578855}]}, 
          '1627': 
             {'in': [{'playerId': 477400}], 
             'out': [{'playerId': 56386}]}, 
          '2725': 
             {'in': [{'playerId': 56108}], 
             'out': [{'playerId': 56383}]}}}}
´´´

I need the data in the following format (in a df): https://i.stack.imgur.com/GltRb.png

That means that I would like to flatten my data so that I have the id: "7453", half: "H2", minute: "2011", type: "out", playerId: "281253". Also, I need one record per player, but that still has all the other data (id, half etc.)

I have been struggling with this for days, and can't seem to find any solution for this particular problem. Until now I have been able to solve it either using, pd.json_normalize() or flatten_json(). But it just doesn't make it for me, in this case. If anyone could point me in the right direction or write some code that could solve this, it would be much appreciated!

FYI: The biggest struggle I have is that I actually need a header/column for my keys.

pandas has explode to unwrap lists but I am not aware of a method for dictionaries.

As your dictionary is extremely well structured, you can try

[28]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd
    ...: .Series).stack().explode().apply(pd.Series).reset_index().rename(column
    ...: s={'level_0': 'teamId', 'level_1': 'matchPeriod', 'level_2': 'eventSec'
    ...: , 'level_3': 'type'})                                                  
Out[28]: 
   teamId matchPeriod eventSec type  playerId
0    7453          2H     1155   in    281253
1    7453          2H     1155   in    169212
2    7453          2H     1155  out    449240
3    7453          2H     1155  out    257943
4    7453          2H     2011   in    449089
..    ...         ...      ...  ...       ...
11   7454          2H     1627  out     56386
12   7454          2H     2725   in     56108
13   7454          2H     2725  out     56383
14   7454          1H     2833   in     56390
15   7454          1H     2833  out    208089

Although extremely ugly chaining the Series constructor and stack will build up the DataFrame level by level.

Update: In principle you can pass a dictionary to the DataFrame and Series constructors

In [2]: d                                                                                                                                                                                                  
Out[2]: 
{'7453': {'2H': {'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
    'out': [{'playerId': 449240}, {'playerId': 257943}]},
   '2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
   '2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}},
 '7454': {'1H': {'2833': {'in': [{'playerId': 56390}],
    'out': [{'playerId': 208089}]}},
  '2H': {'687': {'in': [{'playerId': 574}], 'out': [{'playerId': 578855}]},
   '1627': {'in': [{'playerId': 477400}], 'out': [{'playerId': 56386}]},
   '2725': {'in': [{'playerId': 56108}], 'out': [{'playerId': 56383}]}}}}

In [3]: pd.DataFrame(d)                                                                                                                                                                                    
Out[3]: 
                        7453                      7454
2H  {'1155': {'in': [{'pl...  {'687': {'in': [{'pla...
1H                       NaN  {'2833': {'in': [{'pl...

In [4]: pd.Series(d)                                                                                                                                                                                       
Out[4]: 
7453    {'2H': {'1155': {'in'...
7454    {'1H': {'2833': {'in'...
dtype: object

As they are 2-dimensional and 1-dimensional data structures respectively, they also expect a dictionary with 2 and 1 level deep nesting respectively. The DataFrame interprets your 'teamId' as index and 'matchPeriod' as columns and the values are the values of the dictionaries like in

In [5]: d['7453']['2H']                                                                                                                                                                                    
Out[5]: 
{'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
  'out': [{'playerId': 449240}, {'playerId': 257943}]},
 '2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
 '2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}

The Series behaves the same way, but with only one level.

In [6]: d['7453']                                                                                                                                                                                          
Out[6]: 
{'2H': {'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
   'out': [{'playerId': 449240}, {'playerId': 257943}]},
  '2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
  '2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}}

is your first level. Now this is a dictionary again, so you can pass it the the Series constructor as well

In [7]: pd.Series(d['7453'])                                                                                                                                                                               
Out[7]: 
2H    {'1155': {'in': [{'pl...
dtype: object

The apply function allows you to do this for every row of the Series

In [8]: pd.Series(d).apply(pd.Series)                                                                                                                                                                      
Out[8]: 
                            2H                        1H
7453  {'1155': {'in': [{'pl...                       NaN
7454  {'687': {'in': [{'pla...  {'2833': {'in': [{'pl...

Now you arrive at the same result as with the DataFrame constructor. This is called broadcasting. Each value of the original Series no becomes its own Series and the index is used as column labels. By calling stack you intead tell pandas to give you a series intead and stack all the labels to a MultiIndex if needed.

In [9]: pd.Series(d).apply(pd.Series).stack()                                                                                                                                                              
Out[9]: 
7453  2H    {'1155': {'in': [{'pl...
7454  2H    {'687': {'in': [{'pla...
      1H    {'2833': {'in': [{'pl...
dtype: object

Now you again have a Series (with a 2d index) where each value is a dictionary which - again - can be passed to the Series constructor. So if you repeat this chain of apply(pd.Series).stack() you get

In [10]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack()                                                                                                                                    
Out[10]: 
7453  2H  1155    {'in': [{'playerId': ...
          2011    {'in': [{'playerId': ...
          2568    {'in': [{'playerId': ...
7454  2H  687     {'in': [{'playerId': ...
          1627    {'in': [{'playerId': ...
          2725    {'in': [{'playerId': ...
      1H  2833    {'in': [{'playerId': ...
dtype: object

Now you again have a Series (with a 3d index) where each value is a dictionary which - again - can be passed to the Series constructor.

In [11]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack()                                                                                                           
Out[11]: 
7453  2H  1155  in     [{'playerId': 281253}...
                out    [{'playerId': 449240}...
          2011  in       [{'playerId': 449089}]
                out       [{'playerId': 69374}]
          2568  in       [{'playerId': 481900}]
                out        [{'playerId': 1735}]
7454  2H  687   in          [{'playerId': 574}]
                out      [{'playerId': 578855}]
          1627  in       [{'playerId': 477400}]
                out       [{'playerId': 56386}]
          2725  in        [{'playerId': 56108}]
                out       [{'playerId': 56383}]
      1H  2833  in        [{'playerId': 56390}]
                out      [{'playerId': 208089}]
dtype: object

This is a special case as now your values are no longer dictionaries but lists (with one element each). For lists (and unfortunately not for dictionaries) there is the explode() method in pandas to create a new row for each list element.

In [13]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode()                                                                                                 
Out[13]: 
7453  2H  1155  in     {'playerId': 281253}
                in     {'playerId': 169212}
                out    {'playerId': 449240}
                out    {'playerId': 257943}
          2011  in     {'playerId': 449089}
                               ...         
7454  2H  1627  out     {'playerId': 56386}
          2725  in      {'playerId': 56108}
                out     {'playerId': 56383}
      1H  2833  in      {'playerId': 56390}
                out    {'playerId': 208089}
dtype: object

unpacks each list. Now you again have a Series (with a 4d index) where each value is a dictionary which - again - can be passed to the Series constructor.

In [14]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode().apply(pd.Series).stack()                                                                        
Out[14]: 
7453  2H  1155  in   playerId    281253
                     playerId    169212
                out  playerId    449240
                     playerId    257943
          2011  in   playerId    449089
                                  ...  
7454  2H  1627  out  playerId     56386
          2725  in   playerId     56108
                out  playerId     56383
      1H  2833  in   playerId     56390
                out  playerId    208089
dtype: int64

With these five iterations of applying the Series constructor to your dictionary and reshaping the data until you can apply it again, you got your dictionary fully unpacked.

In order to match your desired result you can make all levels of the index to a column with reset_index .

In [15]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode().apply(pd.Series).stack().reset_index()                                                          
Out[15]: 
   level_0 level_1 level_2 level_3   level_4       0
0     7453      2H    1155      in  playerId  281253
1     7453      2H    1155      in  playerId  169212
2     7453      2H    1155     out  playerId  449240
3     7453      2H    1155     out  playerId  257943
4     7453      2H    2011      in  playerId  449089
..     ...     ...     ...     ...       ...     ...
11    7454      2H    1627     out  playerId   56386
12    7454      2H    2725      in  playerId   56108
13    7454      2H    2725     out  playerId   56383
14    7454      1H    2833      in  playerId   56390
15    7454      1H    2833     out  playerId  208089

Neither the Series nor the index levels had names. By default it uses the column number ( 0 ) for the values (which should be 'playerId') and level_0 to level_4 for the index levels. In order to set these appropriately one way is to rename the Series before calling reset_index and rename the levels with rename afterwards.

I hope that helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM