从python中的缩进文本文件创建树/深层嵌套的dict

Question

Basically, I want to iterate through a file and put the contents of each line into a deeply nested dict, the structure of which is defined by the amount of whitespace at the start of each line. 基本上，我想迭代一个文件并将每行的内容放入一个深度嵌套的dict中，其结构由每行开头的空白量定义。

Essentially the aim is to take something like this: 基本上我们的目标是采取这样的方式：

And turn it into something like this: 把它变成这样的东西：

{"a":{"b":"c","d":"e"}}

Or this: 或这个：

apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10

into this: 进入这个：

{"apple":{"colours":["red","yellow","green"],"type":"granny smith","price":0.10}

So that I can send it to Python's JSON module and make some JSON. 这样我就可以将它发送到Python的JSON模块并制作一些JSON。

At the moment I'm trying to make a dict and a list in steps like such: 目前我正试图按照这样的步骤制作一个字典和一个列表：

{"a":""} ["a"]
{"a":"b"} ["a"]
{"a":{"b":"c"}} ["a","b"]
{"a":{"b":{"c":"d"}}}} ["a","b","c"]
{"a":{"b":{"c":"d"},"e":""}} ["a","e"]
{"a":{"b":{"c":"d"},"e":"f"}} ["a","e"]
{"a":{"b":{"c":"d"},"e":{"f":"g"}}} ["a","e","f"]

etc. 等等

The list acts like 'breadcrumbs' showing where I last put in a dict. 该列表的行为类似于“breadcrumbs”，显示了我最后输入dict的位置。

To do this I need a way to iterate through the list and generate something like dict["a"]["e"]["f"] to get at that last dict. 要做到这一点，我需要一种方法来遍历列表并生成像dict["a"]["e"]["f"]来获得最后一个字典。 I've had a look at the AutoVivification class that someone has made which looks very useful however I'm really unsure of: 我已经看过有人制作的AutoVivification类看起来非常有用但是我真的不确定：

Whether I'm using the right data structure for this (I'm planning to send it to the JSON library to create a JSON object) 我是否正在使用正确的数据结构（我打算将其发送到JSON库以创建JSON对象）
How to use AutoVivification in this instance 如何在此实例中使用AutoVivification
Whether there's a better way in general to approach this problem. 是否有更好的方法来解决这个问题。

I came up with the following function but it doesn't work: 我提出了以下功能，但它不起作用：

def get_nested(dict,array,i):
if i != None:
    i += 1
    if array[i] in dict:
        return get_nested(dict[array[i]],array)
    else:
        return dict
else:
    i = 0
    return get_nested(dict[array[i]],array)

Would appreciate help! 非常感谢帮助！

(The rest of my extremely incomplete code is here:) （其余的非常不完整的代码在这里:)

#Import relevant libraries
import codecs
import sys

#Functions
def stripped(str):
    if tab_spaced:
        return str.lstrip('\t').rstrip('\n\r')
    else:
        return str.lstrip().rstrip('\n\r')

def current_ws():
    if whitespacing == 0 or not tab_spaced:
        return len(line) - len(line.lstrip())
    if tab_spaced:
        return len(line) - len(line.lstrip('\t\n\r'))

def get_nested(adict,anarray,i):
    if i != None:
        i += 1
        if anarray[i] in adict:
            return get_nested(adict[anarray[i]],anarray)
        else:
            return adict
    else:
        i = 0
        return get_nested(adict[anarray[i]],anarray)

#initialise variables
jsondict = {}
unclosed_tags = []
debug = []

vividfilename = 'simple.vivid'
# vividfilename = sys.argv[1]
if len(sys.argv)>2:
    jsfilename = sys.argv[2]
else:
    jsfilename = vividfilename.split('.')[0] + '.json'

whitespacing = 0
whitespace_array = [0,0]
tab_spaced = False

#open the file
with codecs.open(vividfilename,'rU', "utf-8-sig") as vividfile:
    for line in vividfile:
        #work out how many whitespaces at start
        whitespace_array.append(current_ws())

        #For first line with whitespace, work out the whitespacing (eg tab vs 4-space)
        if whitespacing == 0 and whitespace_array[-1] > 0:
            whitespacing = whitespace_array[-1]
            if line[0] == '\t':
                tab_spaced = True

        #strip out whitespace at start and end
        stripped_line = stripped(line)

        if whitespace_array[-1] == 0:
            jsondict[stripped_line] = ""
            unclosed_tags.append(stripped_line)

        if whitespace_array[-2] < whitespace_array[-1]:
            oldnested = get_nested(jsondict,whitespace_array,None)
            print oldnested
            # jsondict.pop(unclosed_tags[-1])
            # jsondict[unclosed_tags[-1]]={stripped_line:""}
            # unclosed_tags.append(stripped_line)

        print jsondict
        print unclosed_tags

print jsondict
print unclosed_tags

Answer 1

Here is a recursive solution. 这是一个递归解决方案。 First, transform the input in the following way. 首先，按以下方式转换输入。

Input: 输入：

person:
    address:
        street1: 123 Bar St
        street2: 
        city: Madison
        state: WI
        zip: 55555
    web:
        email: boo@baz.com

First-step output: 第一步输出：

[{'name':'person','value':'','level':0},
 {'name':'address','value':'','level':1},
 {'name':'street1','value':'123 Bar St','level':2},
 {'name':'street2','value':'','level':2},
 {'name':'city','value':'Madison','level':2},
 {'name':'state','value':'WI','level':2},
 {'name':'zip','value':55555,'level':2},
 {'name':'web','value':'','level':1},
 {'name':'email','value':'boo@baz.com','level':2}]

This is easy to accomplish with split(':') and by counting the number of leading tabs: 使用split(':')和计算前导标签的数量很容易实现：

def tab_level(astr):
    """Count number of leading tabs in a string
    """
    return len(astr)- len(astr.lstrip('\t'))

Then feed the first-step output into the following function: 然后将第一步输出提供给以下函数：

def ttree_to_json(ttree,level=0):
    result = {}
    for i in range(0,len(ttree)):
        cn = ttree[i]
        try:
            nn  = ttree[i+1]
        except:
            nn = {'level':-1}

        # Edge cases
        if cn['level']>level:
            continue
        if cn['level']<level:
            return result

        # Recursion
        if nn['level']==level:
            dict_insert_or_append(result,cn['name'],cn['value'])
        elif nn['level']>level:
            rr = ttree_to_json(ttree[i+1:], level=nn['level'])
            dict_insert_or_append(result,cn['name'],rr)
        else:
            dict_insert_or_append(result,cn['name'],cn['value'])
            return result
    return result

where: 哪里：

def dict_insert_or_append(adict,key,val):
    """Insert a value in dict at key if one does not exist
    Otherwise, convert value to list and append
    """
    if key in adict:
        if type(adict[key]) != list:
            adict[key] = [adict[key]]
        adict[key].append(val)
    else:
        adict[key] = val

Answer 2

The following code will take a block-indented file and convert into an XML tree; 以下代码将采用块缩进文件并转换为XML树; this: 这个：

foo
bar
baz
  ban
  bal

...becomes: ...变为：

<cmd>foo</cmd>
<cmd>bar</cmd>
<block>
  <name>baz</name>
  <cmd>ban</cmd>
  <cmd>bal</cmd>
</block>

The basic technique is: 基本技术是：

Set indent to 0 将缩进设置为0
For each line, get the indent 对于每一行，获取缩进
If > current, step down and save current block/ident on a stack 如果> current，则降低并在堆栈上保存当前块/标识
If == current, append to current block 如果== current，则追加到当前块
If < current, pop from the stack until you get to the matching indent 如果<current，则从堆栈弹出，直到找到匹配的缩进

So: 所以：

from lxml import builder
C = builder.ElementMaker()

def indent(line):
    strip = line.lstrip()
    return len(line) - len(strip), strip

def parse_blockcfg(data):
    top = current_block = C.config()
    stack = []
    current_indent = 0

    lines = data.split('\n')
    while lines:
        line = lines.pop(0)
        i, line = indent(line)

        if i==current_indent:
            pass

        elif i > current_indent:
            # we've gone down a level, convert the <cmd> to a block
            # and then save the current ident and block to the stack
            prev.tag = 'block'
            prev.append(C.name(prev.text))
            prev.text = None
            stack.insert(0, (current_indent, current_block))
            current_indent = i
            current_block = prev

        elif i < current_indent:
            # we've gone up one or more levels, pop the stack
            # until we find out which level and return to it
            found = False
            while stack:
                parent_indent, parent_block = stack.pop(0)
                if parent_indent==i:
                    found = True
                    break
            if not found:
                raise Exception('indent not found in parent stack')
            current_indent = i
            current_block = parent_block

        prev = C.cmd(line)
        current_block.append(prev)

    return top

Answer 3

Here is an object oriented approach based on a composite structure of nested Node objects. 这是一种基于嵌套Node对象的复合结构的面向对象方法。

Input: 输入：

indented_text = \
"""
apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10
"""

a Node class 一个Node类

class Node:
    def __init__(self, indented_line):
        self.children = []
        self.level = len(indented_line) - len(indented_line.lstrip())
        self.text = indented_line.strip()

    def add_children(self, nodes):
        childlevel = nodes[0].level
        while nodes:
            node = nodes.pop(0)
            if node.level == childlevel: # add node as a child
                self.children.append(node)
            elif node.level > childlevel: # add nodes as grandchildren of the last child
                nodes.insert(0,node)
                self.children[-1].add_children(nodes)
            elif node.level <= self.level: # this node is a sibling, no more children
                nodes.insert(0,node)
                return

    def as_dict(self):
        if len(self.children) > 1:
            return {self.text: [node.as_dict() for node in self.children]}
        elif len(self.children) == 1:
            return {self.text: self.children[0].as_dict()}
        else:
            return self.text

To parse the text, first create a root node. 要解析文本，请首先创建根节点。 Then, remove empty lines from the text, and create a Node instance for every line, pass this to the add_children method of the root node. 然后，从文本中删除空行，并为每一行创建一个Node实例，将其传递给add_children方法。

root = Node('root')
root.add_children([Node(line) for line in indented_text.splitlines() if line.strip()])
d = root.as_dict()['root']
print(d)

result: 结果：

{'apple': [
  {'colours': ['red', 'yellow', 'green']},
  {'type': 'granny smith'},
  {'price': '0.10'}]
}

I think that it should be possible to do it in one step, where you simply call the constructor of Node once, with the indented text as an argument. 我认为应该可以在一步中完成它，你只需要调用Node的构造函数一次，并将缩进的文本作为参数。

Answer 4

First of all, don't use array and dict as variable names because they're reserved words in Python and reusing them may end up in all sorts of chaos. 首先，不要使用array和dict作为变量名，因为它们是Python中的保留字并且重用它们可能最终会出现各种混乱。

OK so if I get you correctly, you have a tree given in a text file, with parenthood indicated by indentations, and you want to recover the actual tree structure. 好吧，如果我找到你的话，你会在一个文本文件中给出一个树，并用缩进表示父母，并且你想要恢复实际的树结构。 Right? 对？

Does the following look like a valid outline? 以下内容是否有效？ Because I have trouble putting your current code into context. 因为我无法将当前代码放入上下文中。

result = {}
last_indentation = 0
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
   # sibling to last
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level

OK then your list are the current parents, that's in fact right - but I'd keep them pointed to the dictionary you've created, not the literal letter 那么你的列表是当前的父母，这实际上是正确的 - 但我会让他们指向你创建的字典，而不是文字字母

just starting some stuff here 刚开始做一些东西

result = {}
parents = {}
last_indentation = 1 # start with 1 so 0 is the root of tree
parents[0] = result
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
       new_el = {}
       parents[i-1][c] = new_el
       parents[i] = new_el
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level

从python中的缩进文本文件创建树/深层嵌套的dict

问题描述

4 个解决方案

解决方案1
5 已采纳 2014-07-26 01:05:24

解决方案2
2 2014-03-14 15:40:04

解决方案3
2 2018-11-16 22:27:43

解决方案4
0 2013-07-25 13:06:36

从python中的缩进文本文件创建树/深层嵌套的dict

问题描述

4 个解决方案

解决方案1 5 已采纳 2014-07-26 01:05:24

解决方案2 2 2014-03-14 15:40:04

解决方案3 2 2018-11-16 22:27:43

解决方案4 0 2013-07-25 13:06:36

解决方案1
5 已采纳 2014-07-26 01:05:24

解决方案2
2 2014-03-14 15:40:04

解决方案3
2 2018-11-16 22:27:43

解决方案4
0 2013-07-25 13:06:36