简体   繁体   中英

Python create tree from a JSON file

Let's say that we have the following JSON file. For the sake of the example it's emulated by a string. The string is the input and a Tree object should be the output. I'll be using the graphical notation of a tree to present the output.

I've found the following classes to handle tree concept in Python:

class TreeNode(object):
    def __init__(self, data):
        self.data = data
        self.children = []

    def add_child(self, obj):
        self.children.append(obj)

    def __str__(self, level=0):
        ret = "\t"*level+repr(self.data)+"\n"
        for child in self.children:
            ret += child.__str__(level+1)
        return ret

    def __repr__(self):
        return '<tree node representation>'

class Tree:
    def __init__(self):
        self.root = TreeNode('ROOT')

    def __str__(self):
        return self.root.__str__()

The input file can be of different complexity:

Simple case

Input:

json_file = '{"item1": "end1", "item2": "end2"}'

Output:

"ROOT"
    item1
        end1
    item2
        end2

Embedded case

Input:

json_file = {"item1": "end1", "item2": {"item3": "end3"}}

Output:

"ROOT"
    item1
        end1
    item2
        item3
            end3

Array case

Input:

json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }

Output:

"ROOT"
    name
        John
    items
        1
            item_name
                lettuce
            price
                2.65
            units
                no
        2   
            item_name
                ketchup
            price
                1.51
            units
                litres

Please note that each item in an array is described with an integer (starting at 1).

So far I've managed to come up with the following function that solves the problem for the simple case. In terms of the embedded case I know that I must use recursion but so far I get UnboundLocalError: local variable 'tree' referenced before assignment .

def create_tree_from_JSON(json, parent=None):
    if not parent:
        tree = Tree()
        node_0 = TreeNode("ROOT")
        tree.root = node_0
        parent = node_0
    else:
        parent = parent

    for key in json:
        if isinstance(json[key], dict):
            head = TreeNode(key)
            create_tree_from_JSON(json[key], head)
        else:
            node = TreeNode(key)
            node.add_child(TreeNode(json[key]))
            parent.add_child(node)

    return tree

Problem's background

You may wonder why would I need to change a JSON object into a tree. As you may know PostgreSQL provides a way to handle JSON fields in the database. Given a JSON object I can get the value of any field by using -> and ->> notation. Here and here more about the subject. I will be creating new tables based on the fields' names and values. Unfortunately the JSON objects vary to such an extent that I cannot write the .sql code manually - I must find a way to do it automatically.

Let's assume that I want to create a table based on the embedded case. I need to get the following .sql code:

select 
    content_json ->> 'item1' as end1,
    content_json -> 'item_2' ->> 'item_3' as end3
from table_with_json

Substitute content_json for "ROOT" and you can see that each line in SQL code is simply a depth-first traversal from "ROOT" to a leaf (move from the last node to leaf is always annotated with ->> ).

EDIT: In order to make the question more clear I'm adding the target .sql query for the array case. I would like there to be as many queries as there are elements in the array:

select
    content_json ->> 'name' as name,
    content_json -> 'items' -> 1 -> 'item_name' as item_name,
    content_json -> 'items' -> 1 -> 'price' as price,
    content_json -> 'items' -> 1 -> 'units' as units
from table_with_json

select
    content_json ->> 'name' as name,
    content_json -> 'items' -> 2 ->> 'item_name' as item_name,
    content_json -> 'items' -> 2 ->> 'price' as price,
    content_json -> 'items' -> 2 ->> 'units' as units
from table_with_json

Solution so far (07.05.2019)

I'm testing the current solution for the moment:

from collections import OrderedDict

def treeify(data) -> dict:
    if isinstance(data, dict):  # already have keys, just recurse
        return OrderedDict((key, treeify(children)) for key, children in data.items())
    elif isinstance(data, list):  # make keys from indices
        return OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1))
    else:  # leave node, no recursion
        return data

def format_query(tree, stack=('content_json',)) -> str:
    if isinstance(tree, dict):  # build stack of keys
        for key, child in tree.items():
            yield from format_query(child, stack + (key,))
    else:  # print complete stack, discarding leaf data in tree
        *keys, field = stack
        path = ' -> '.join(
            str(key) if isinstance(key, int) else "'%s'" % key
            for key in keys
        )
        yield path + " ->> '%s' as %s" % (field, field)

def create_select_query(lines_list):
    query = "select\n"
    for line_number in range(len(lines_list)):
        if "_class" in lines_list[line_number]:
            # ignore '_class' fields
            continue
        query += "\t" + lines_list[line_number]
        if line_number == len(lines_list)-1:
            query += "\n"
        else:
            query += ",\n"
    query += "from table_with_json"
    return query

I'm currently working on a JSON like this:

stack_nested_example = {"_class":"value_to_be_ignored","first_key":{"second_key":{"user_id":"123456","company_id":"9876","question":{"subject":"some_subject","case_type":"urgent","from_date":{"year":2011,"month":11,"day":11},"to_date":{"year":2012,"month":12,"day":12}},"third_key":[{"role":"driver","weather":"great"},{"role":"father","weather":"rainy"}]}}}

In the output I get the only constant element is the order of lines treated with array logic. Order of other lines differs. The output I would like to get is the one that takes into account order of the keys:

select
        'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id,
        'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather
from table_with_json

In your create_tree_from_JSON you never pass on the tree during recursion. Yet you try to return it.

def create_tree_from_JSON(json, parent=None):
    if not parent:
        tree = Tree()  # tree is only created for root node
        ...
    else:
        parent = parent  # tree is not created here
    ...
    return tree  # tree is always returned

Either pass on the tree during recursion, or separate the root step from the others:

def create_tree_from_JSON(json):  # root case
    tree = Tree()
    node_0 = TreeNode("ROOT")
    tree.root = node_0
    parent = node_0
    _walk_tree(json, parent)

def _walk_tree(json, parent):  # recursive case
    for key in json:
        if isinstance(json[key], dict):
            head = TreeNode(key)
            _walk_tree(json[key], head)
        else:
            node = TreeNode(key)
            node.add_child(TreeNode(json[key]))
        parent.add_child(node)

Note that what you are doing can be solved much easier using plain dict s. Your class is effectively just wrapping a custom interface around dict to begin with.

def treeify(data) -> dict:
    if isinstance(data, dict):  # already have keys, just recurse
       return {key: treeify(children) for key, children in data.items()}
    elif isinstance(data, list):  # make keys from indices
       return {idx: treeify(children) for idx, children in enumerate(data, start=1)}
    else:  # leave node, no recursion
       return data

You can feed any decoded json data to this.

>>> treeify(json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })
{'name': 'John', 'items': {1: {'item_name': 'lettuce', 'price': 2.65, 'units': 'no'}, 2: {'item_name': 'ketchup', 'price': 1.51, 'units': 'litres'}}}

To get the desired pretty-printed output, you can walk this structure with a stack of current keys. A generator is appropriate to create each query line on the fly:

def format_query(tree, stack=('content_json',)) -> str:
    if isinstance(tree, dict):  # build stack of keys
        for key, child in tree.items():
            yield from format_query(child, stack + (key,))
    else:  # print complete stack, discarding leaf data in tree
       *keys, field = stack
       path = ' -> '.join(
           str(key) if isinstance(key, int) else "'%s'" % key
           for key in keys
       )
       yield path + " ->> '%s' as %s" % (field, field)

Given your second example, this allows you to get a list of query lines:

>>> list(format_query(treeify({ "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })))
["'content_json' ->> 'name' as name",
 "'content_json' -> 'items' -> 1 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 1 ->> 'price' as price",
 "'content_json' -> 'items' -> 1 ->> 'units' as units",
 "'content_json' -> 'items' -> 2 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 2 ->> 'price' as price",
 "'content_json' -> 'items' -> 2 ->> 'units' as units"]

You can use recursion:

def format_query(d):
  if all(not isinstance(i, tuple) for i in d):
    return 'select\n{}\nfrom table_with_json'.format(',\n'.join('\tcontent_json {}'.format("->> '{}' as {}".format(i[0], i[0]) if len(i) == 1 else "-> {} ->> '{}' as {}".format(' -> '.join("'{}'".format(j) for j in i[:-1]), i[-1], i[-1])) for i in d))
  return '\n\n'.join(format_query([c for b in i for c in b]) for i in d)

def get_dict(d, c = []):
  for a, b in d.items():
     if not isinstance(b, (dict, list)):
       yield c+[a]
     elif isinstance(b, dict):
       yield from to_query(b, c+[a])

def to_query(d, q = []):
  if not any(isinstance(i, list) for i in d.values()):
     yield from get_dict(d, c=q)
  else:
     _c = list(get_dict(d))
     for a, b in d.items():
       if isinstance(b, list):
         for i, j in enumerate(b, 1):
            yield (_c, list(get_dict(j, [a, i])))

Now, to format:

json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }
print(format_query(list(to_query(json_file))))

Output:

select
      content_json ->> 'name' as name,
      content_json -> 'items' -> '1' ->> 'item_name' as item_name,
      content_json -> 'items' -> '1' ->> 'price' as price,
      content_json -> 'items' -> '1' ->> 'units' as units
from table_with_json

select
     content_json ->> 'name' as name,
     content_json -> 'items' -> '2' ->> 'item_name' as item_name,
     content_json -> 'items' -> '2' ->> 'price' as price,
     content_json -> 'items' -> '2' ->> 'units' as units
from table_with_json

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM