简体   繁体   中英

Convert pandas hierarchical DataFrame to JSON

I have a pandas DataFrame that contains hierarchical elements from my database. parent_id columns tells me to what parent each child belongs. I would like to generate a nested JSON from this DataFrame (see section 3 for needed format). I tried many things, but I am mainly stuck on the _children list field in each parent, which should contain all its children. Any help is more than welcome, Thanks!

1. pandas DataFrame (WHAT I HAVE)

parent_id    child_id   level
125582       214659     1
125582       214633     1
125582       214263     1
214263       131673     2
214263       125579     2
214263       125578     2
214263       172670     2
214263       214266     2
214266       216675     3
214266       216671     3
172670       172669     3
172670       174777     3
172670       207661     3
207661       216529     4
207661       223884     4
223884       223885     5
223885       229186     6
229186       219062     7
229186       222243     7
and so on...

2. DataFrame logic

125582 (level 0 - root)
--214659 (level 1)
--214633 (level 1)
--214263 (level 1)
----131673 (level 2)
----125579 (level 2)
----125578 (level 2)
----172670 (level 2)
------172669 (level 3)
------174777 (level 3)
------207661 (level 3)
--------216529 (level 4)
--------223884 (level 4)
----------223885 (level 5)
------------229186 (level 6)
--------------219062 (level 7)    
--------------222243 (level 7)
----214266 (level 2)
------216675 (level 3)
------216671 (level 3)
and so on...

3. JSON (WHAT I NEED)

[
    {id: 125582, level: 0, "_children": [
        {id: 214659, level: 1},
        {id: 214633, level: 1},
        {id: 214263, level: 1, "_children": [
            {id: 131673, level: 2},
            {id: 125579, level: 2},
            {id: 125578, level: 2},
            {id: 172670, level: 2, "_children": [
                {id: 172669, level: 3},
                {id: 174777, level: 3},
                {id: 207661, level: 3, "_children": [
                    {id: 216529, level: 4},
                    {id: 223884, level: 4, "_children": [
                        {id: 223885, level: 5, "_children": [
                            {id: 229186, level: 6, "_children": [
                                {id: 219062, level: 7},
                                {id: 222243, level: 7}
                            ]}
                        ]}
                    ]}
                ]}
            ]},
            {id: 214266, level: 2, "_children": [
                {id: 216675, level: 3},
                {id: 216671, level: 3}
            ]}
        ]},
        and so on...
    ]}
]

This hierarchical json has parent->child relationship. Therefore, I suggest you use networkx package to construct a graph g from the dateframe. From graph g create the desired json.

You need to pip install networkx following this instruction

After that, do the following steps:

  • import networkx as nx
  • create the directed graph g using from_pandas_edgelist
  • create a dictionary d with unique key on parent_id and child_id with values being as level
  • assign d to nodes attributes level of graph g
  • finally, call nx.tree_data on g

The code as follows:

import networkx as nx

#create the directed graph `g` using `from_pandas_edgelist`
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'], 
                            create_using=nx.DiGraph())

#create a dictionary with unique key on `parent_id` and `child_id`
df1 = df.melt('level')
d = dict(zip(df1.value, df1.level))
root_id = df.loc[0, 'parent_id']
d[root_id] = 0  #set level of root to `0`

#assign `d` to nodes attributes `level` of graph `g`
nx.set_node_attributes(g, d, 'level')    #add `level` values to nodes of `g`

out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]

Out[119]:
[{'level': 0,
  'id': 125582,
  '_children': [{'level': 1, 'id': 214659},
   {'level': 1, 'id': 214633},
   {'level': 1,
    'id': 214263,
    '_children': [{'level': 2, 'id': 131673},
     {'level': 2, 'id': 125579},
     {'level': 2, 'id': 125578},
     {'level': 2,
      'id': 172670,
      '_children': [{'level': 3, 'id': 172669},
       {'level': 3, 'id': 174777},
       {'level': 3,
        'id': 207661,
        '_children': [{'level': 4, 'id': 216529},
         {'level': 4,
          'id': 223884,
          '_children': [{'level': 5,
            'id': 223885,
            '_children': [{'level': 6,
              'id': 229186,
              '_children': [{'level': 7, 'id': 219062},
               {'level': 7, 'id': 222243}]}]}]}]}]},
     {'level': 2,
      'id': 214266,
      '_children': [{'level': 3, 'id': 216675},
       {'level': 3, 'id': 216671}]}]}]}]

Note : the output above is the dictionary inside of list as your desired output. My solution output puts the key level in front of key id , but the hierarchical/structural of sub-dictionaries is the same as your desired output


On the other hand, if you don't need the level values in the desired output, just directly use the built-in nx.tree_data function to return the the output as follows

import networkx as nx

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'], 
                            create_using=nx.DiGraph())

root_id = df.loc[0, 'parent_id']    
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]

Out[168]:
[{'id': 125582,
  '_children': [{'id': 214659},
   {'id': 214633},
   {'id': 214263,
    '_children': [{'id': 131673},
     {'id': 125579},
     {'id': 125578},
     {'id': 172670,
      '_children': [{'id': 172669},
       {'id': 174777},
       {'id': 207661,
        '_children': [{'id': 216529},
         {'id': 223884,
          '_children': [{'id': 223885,
            '_children': [{'id': 229186,
              '_children': [{'id': 219062}, {'id': 222243}]}]}]}]}]},
     {'id': 214266, '_children': [{'id': 216675}, {'id': 216671}]}]}]}]

Update :

Use this modified function of nx.tree_data to overcome the restriction on single parent per child. Add this customized function to your code and call it instead of nx.tree_data

from itertools import chain

def tree_data_custom(G, root):
    id_ = 'id'
    children = '_children'
    def add_children(n, G):
        nbrs = G[n]
        if len(nbrs) == 0:
            return []
        children_ = []
        for child in nbrs:
            d = dict(chain(G.nodes[child].items(), [(id_, child)]))
            c = add_children(child, G)
            if c:
                d[children] = c
            children_.append(d)
        return children_

    data = dict(chain(G.nodes[root].items(), [(id_, root)]))    
    data[children] = add_children(root, G)
    return data

Instead of call out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})] , just call tree_data_custom as follows

out =  [tree_data_custom(g, root_id)]

Update 2 : Adding name column to dataframe

sample df where children has multiple parents

Out[258]:
    parent_id  child_id  level name
0      125582    214659      1   a1
1      125582    214633      1   a1
2      125582    214263      1   a1
3      214263    131673      2   a2
4      214263    125579      2   a2
5      214263    125578      2   a2
6      214263    172670      2   a2
7      214263    214266      2   a2
8      214266    216675      3   a3
9      214266    216671      3   a3
10     172670    172669      3   a3
11     172670    174777      3   a3
12     172670    207661      3   a3
13     207661    216529      4   a4
14     207661    223884      4   a4
15     223884    223885      5   a5
16     223885    229186      6   a6
17     229186    219062      7   a7
18     229186    222243      7   a7
19     222243    219187      8   a8
20     222243    245985      8   a8
21     222243    232393      8   a8
22     222243    247138      8   a8
23     222243    228848      8   a8
24     222243    228848      8   a8
25     222243    233920      8   a8
26     222243    233920      8   a8
27     222243    228113      8   a8
28     222243    233767      8   a8
29     222243    235407      8   a8
30     222243    237757      8   a8
31     222243    159091      8   a8
32     222243    159091      8   a8
33     222243    214832      8   a8
34     222243    253990      8   a8
35     222243    231610      8   a8
36     222243    231610      8   a8
37     222243    182323      8   a8
38     222243    242190      8   a8
39     222243    143580      8   a8
40     222243    242188      8   a8
41     222243    143581      8   a8
42     222243    242187      8   a8
43     222243    143582      8   a8
44     222243    242189      8   a8
45     222243    205877      8   a8
46     222243    242823      8   a8
47     222243    140979      8   a8
48     222243    237824      8   a8
49     222243    149933      8   a8
50     222243    149933      8   a8
51     222243    153625      8   a8
52     222243      8392      8   a8
53     222243    162085      8   a8
54     222243    162085      8   a8
55     222243    150691      8   a8
56     222243    147773      8   a8
57     222243    147773      8   a8
58     222243     61070      8   a8
59     222243     61070      8   a8
60     222243    204850      8   a8
61     222243    204850      8   a8
62      61070     46276      9   a9
63      61070     46276      9   a9
64      61070     46276      9   a9
65      61070     46276      9   a9
66     143580    159911      9   a9
67     143580     38958      9   a9
68     182323    159911      9   a9

The change is minimal, you only need to modify the step create dictionaries for nodes attributes of g . Currently, we only add attribute level to g . Now, we need to create another dictionary for name and add it to the new attribute name of nodes of g

#create the directed graph `g` using `from_pandas_edgelist`. 
#You don't need `[level]` in this step
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

#create a dictionary with unique key on `parent_id` and `child_id`
#`melt` keep 2 columns 'level', 'name' instead of one column 'level'
#dictionary `d_level` for attribute `level` of `g's` nodes
#dictionary `d_name` for attribute `level` of `g's` nodes
df1 = df.melt(['level', 'name'])
d_level = dict(zip(df1.value, df1.level))
d_name = dict(zip(df1.value, df1.name))
root_id = df.loc[0, 'parent_id']
d_level[root_id] = 0  #set `level` of root to `0`

#assign `d_level` to nodes attributes `level` of graph `g` and `d_name` for `name`
nx.set_node_attributes(g, d_level, 'level')    #add `level` values to nodes of `g`
nx.set_node_attributes(g, d_name, 'name')    #add `name` values to nodes of `g`

#use customize `tree_data_custom` defined previously
out = [tree_data_custom(g, root_id)] 

If you adding many more columns to each row, it is better to create a single dictionary for all columns and apply one time to nodes of g as follows

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

df1 = df.melt(['level', 'name'])
#this single dictionary to create both `level` and `name` attributes of nodes of `g`
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.value, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0  #set level of root to `0`

nx.set_node_attributes(g, d)    #a single add for both attributes `level`, `name`

out = [tree_data_custom(g, root_id)] 

Output

Out[260]:
[{'level': 0,
  'name': 'a1',
  'id': 125582,
  '_children': [{'level': 1, 'name': 'a1', 'id': 214659},
   {'level': 1, 'name': 'a1', 'id': 214633},
   {'level': 1,
    'name': 'a1',
    'id': 214263,
    '_children': [{'level': 2, 'name': 'a2', 'id': 131673},
     {'level': 2, 'name': 'a2', 'id': 125579},
     {'level': 2, 'name': 'a2', 'id': 125578},
     {'level': 2,
      'name': 'a2',
      'id': 172670,
      '_children': [{'level': 3, 'name': 'a3', 'id': 172669},
       {'level': 3, 'name': 'a3', 'id': 174777},
       {'level': 3,
        'name': 'a3',
        'id': 207661,
        '_children': [{'level': 4, 'name': 'a4', 'id': 216529},
         {'level': 4,
          'name': 'a4',
          'id': 223884,
          '_children': [{'level': 5,
            'name': 'a5',
            'id': 223885,
            '_children': [{'level': 6,
              'name': 'a6',
              'id': 229186,
              '_children': [{'level': 7, 'name': 'a7', 'id': 219062},
               {'level': 7,
                'name': 'a7',
                'id': 222243,
                '_children': [{'level': 8, 'name': 'a8', 'id': 219187},
                 {'level': 8, 'name': 'a8', 'id': 245985},
                 {'level': 8, 'name': 'a8', 'id': 232393},
                 {'level': 8, 'name': 'a8', 'id': 247138},
                 {'level': 8, 'name': 'a8', 'id': 228848},
                 {'level': 8, 'name': 'a8', 'id': 233920},
                 {'level': 8, 'name': 'a8', 'id': 228113},
                 {'level': 8, 'name': 'a8', 'id': 233767},
                 {'level': 8, 'name': 'a8', 'id': 235407},
                 {'level': 8, 'name': 'a8', 'id': 237757},
                 {'level': 8, 'name': 'a8', 'id': 159091},
                 {'level': 8, 'name': 'a8', 'id': 214832},
                 {'level': 8, 'name': 'a8', 'id': 253990},
                 {'level': 8, 'name': 'a8', 'id': 231610},
                 {'level': 8,
                  'name': 'a8',
                  'id': 182323,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 159911}]},
                 {'level': 8, 'name': 'a8', 'id': 242190},
                 {'level': 8,
                  'name': 'a8',
                  'id': 143580,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 159911},
                   {'level': 9, 'name': 'a9', 'id': 38958}]},
                 {'level': 8, 'name': 'a8', 'id': 242188},
                 {'level': 8, 'name': 'a8', 'id': 143581},
                 {'level': 8, 'name': 'a8', 'id': 242187},
                 {'level': 8, 'name': 'a8', 'id': 143582},
                 {'level': 8, 'name': 'a8', 'id': 242189},
                 {'level': 8, 'name': 'a8', 'id': 205877},
                 {'level': 8, 'name': 'a8', 'id': 242823},
                 {'level': 8, 'name': 'a8', 'id': 140979},
                 {'level': 8, 'name': 'a8', 'id': 237824},
                 {'level': 8, 'name': 'a8', 'id': 149933},
                 {'level': 8, 'name': 'a8', 'id': 153625},
                 {'level': 8, 'name': 'a8', 'id': 8392},
                 {'level': 8, 'name': 'a8', 'id': 162085},
                 {'level': 8, 'name': 'a8', 'id': 150691},
                 {'level': 8, 'name': 'a8', 'id': 147773},
                 {'level': 8,
                  'name': 'a8',
                  'id': 61070,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 46276}]},
                 {'level': 8, 'name': 'a8', 'id': 204850}]}]}]}]}]}]},
     {'level': 2,
      'name': 'a2',
      'id': 214266,
      '_children': [{'level': 3, 'name': 'a3', 'id': 216675},
       {'level': 3, 'name': 'a3', 'id': 216671}]}]}]}]

Update 3 : handle columns parent_name and child_name

You need to melt at once 4 columns parent_name , child_name , parent_id , parent_id . It is functionality of wide_to_long

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

df1 = df.rename(lambda x: '_'.join(x.split('_')[::-1]), axis=1)
df1 = pd.wide_to_long(df1.reset_index(), stubnames=['id', 'name'],
                i='index', j='type', suffix='\w+', sep='_')
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.id, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0

nx.set_node_attributes(g, d)
out = [tree_data_custom(g, root_id)] 

Out[91]:
[{'level': 0,
  'name': 'word1',
  'id': 125582,
  '_children': [{'level': 1, 'name': 'word6', 'id': 214659},
   {'level': 1, 'name': 'word7', 'id': 214633},
   {'level': 1,
    'name': 'word2',
    'id': 214263,
    '_children': [{'level': 2, 'name': 'word8', 'id': 131673},
     {'level': 2, 'name': 'word9', 'id': 125579},
     {'level': 2, 'name': 'word10', 'id': 125578},
     {'level': 2,
      'name': 'word4',
      'id': 172670,
      '_children': [{'level': 3, 'name': 'word13', 'id': 172669},
       {'level': 3, 'name': 'word14', 'id': 174777},
       {'level': 3,
        'name': 'word5',
        'id': 207661,
        '_children': [{'level': 4, 'name': 'word15', 'id': 216529},
         {'level': 4,
          'name': 'word16',
          'id': 223884,
          '_children': [{'level': 5,
            'name': 'word17',
            'id': 223885,
            '_children': [{'level': 6,
              'name': 'word18',
              'id': 229186,
              '_children': [{'level': 7, 'name': 'word19', 'id': 219062},
               {'level': 7, 'name': 'word20', 'id': 222243}]}]}]}]}]},
     {'level': 2,
      'name': 'word3',
      'id': 214266,
      '_children': [{'level': 3, 'name': 'word11', 'id': 216675},
       {'level': 3, 'name': 'word12', 'id': 216671}]}]}]}]

It's not a pandas one-liner, but here's a recursive function that should do the trick.

import json

import pandas as pd

df = pd.DataFrame(
    [{"parent_id": 125582, "child_id": 214659, "level": 1}, {"parent_id": 125582, "child_id": 214633, "level": 1},
     {"parent_id": 125582, "child_id": 214263, "level": 1}, {"parent_id": 214263, "child_id": 131673, "level": 2},
     {"parent_id": 214263, "child_id": 125579, "level": 2}, {"parent_id": 214263, "child_id": 125578, "level": 2},
     {"parent_id": 214263, "child_id": 172670, "level": 2}, {"parent_id": 214263, "child_id": 214266, "level": 2},
     {"parent_id": 214266, "child_id": 216675, "level": 3}, {"parent_id": 214266, "child_id": 216671, "level": 3},
     {"parent_id": 172670, "child_id": 172669, "level": 3}, {"parent_id": 172670, "child_id": 174777, "level": 3},
     {"parent_id": 172670, "child_id": 207661, "level": 3}, {"parent_id": 207661, "child_id": 216529, "level": 4},
     {"parent_id": 207661, "child_id": 223884, "level": 4}, {"parent_id": 223884, "child_id": 223885, "level": 5},
     {"parent_id": 223885, "child_id": 229186, "level": 6}, {"parent_id": 229186, "child_id": 219062, "level": 7},
     {"parent_id": 229186, "child_id": 222243, "level": 7}]
)


def recurse(parent_id, level):
    # create the base result
    result = {"id": int(parent_id), "level": int(level)}

    # get all of the children of this parent, one level below this one
    children = df[(df.parent_id == parent_id) & (df["level"] == level + 1)]

    # if there are no such children, then return without a _children key
    if children.empty:
        return result

    # otherwise, recurse on each child_id
    result["_children"] = [recurse(child_id, level + 1) for child_id in sorted(children.child_id.unique())]

    return result


# start at an implicit level 0
tree = [recurse(parent_id, 0) for parent_id in sorted(df[df["level"] == 1].parent_id.unique())]

# convert to JSON
print(json.dumps(tree, indent=3, sort_keys=True))
id_to_obj = {}
roots = []
def parse(row):
    pid, cid, lvl = row.parent_id, row.child_id, row.level
    if pid not in id_to_obj:
        # parent not created yet: create parent as root
        id_to_obj[pid] = {
            'id': pid, 'level': 0
        }
        roots.append(pid)

    # Get parent object
    parent = id_to_obj[pid]
    if not '_children' in parent:
        parent['_children'] = []

    # Create child object and add to parent's children list
    child = {
        'id': cid, 'level': lvl
    }
    parent['_children'].append(child)
    id_to_obj[cid] = child

df.sort_values(by='level').apply(parse, axis=1)

# Result
[id_to_obj[pid] for pid in roots]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM