简体   繁体   English

从python中的缩进文本文件创建树/深层嵌套的dict

[英]Creating a tree/deeply nested dict from an indented text file in python

Basically, I want to iterate through a file and put the contents of each line into a deeply nested dict, the structure of which is defined by the amount of whitespace at the start of each line. 基本上,我想迭代一个文件并将每行的内容放入一个深度嵌套的dict中,其结构由每行开头的空白量定义。

Essentially the aim is to take something like this: 基本上我们的目标是采取这样的方式:

a
    b
        c
    d
        e

And turn it into something like this: 把它变成这样的东西:

{"a":{"b":"c","d":"e"}}

Or this: 或这个:

apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10

into this: 进入这个:

{"apple":{"colours":["red","yellow","green"],"type":"granny smith","price":0.10}

So that I can send it to Python's JSON module and make some JSON. 这样我就可以将它发送到Python的JSON模块并制作一些JSON。

At the moment I'm trying to make a dict and a list in steps like such: 目前我正试图按照这样的步骤制作一个字典和一个列表:

  1. {"a":""} ["a"]
  2. {"a":"b"} ["a"]
  3. {"a":{"b":"c"}} ["a","b"]
  4. {"a":{"b":{"c":"d"}}}} ["a","b","c"]
  5. {"a":{"b":{"c":"d"},"e":""}} ["a","e"]
  6. {"a":{"b":{"c":"d"},"e":"f"}} ["a","e"]
  7. {"a":{"b":{"c":"d"},"e":{"f":"g"}}} ["a","e","f"]

etc. 等等

The list acts like 'breadcrumbs' showing where I last put in a dict. 该列表的行为类似于“breadcrumbs”,显示了我最后输入dict的位置。

To do this I need a way to iterate through the list and generate something like dict["a"]["e"]["f"] to get at that last dict. 要做到这一点,我需要一种方法来遍历列表并生成像dict["a"]["e"]["f"]来获得最后一个字典。 I've had a look at the AutoVivification class that someone has made which looks very useful however I'm really unsure of: 我已经看过有人制作的AutoVivification类看起来非常有用但是我真的不确定:

  1. Whether I'm using the right data structure for this (I'm planning to send it to the JSON library to create a JSON object) 我是否正在使用正确的数据结构(我打算将其发送到JSON库以创建JSON对象)
  2. How to use AutoVivification in this instance 如何在此实例中使用AutoVivification
  3. Whether there's a better way in general to approach this problem. 是否有更好的方法来解决这个问题。

I came up with the following function but it doesn't work: 我提出了以下功能,但它不起作用:

def get_nested(dict,array,i):
if i != None:
    i += 1
    if array[i] in dict:
        return get_nested(dict[array[i]],array)
    else:
        return dict
else:
    i = 0
    return get_nested(dict[array[i]],array)

Would appreciate help! 非常感谢帮助!

(The rest of my extremely incomplete code is here:) (其余的非常不完整的代码在这里:)

#Import relevant libraries
import codecs
import sys

#Functions
def stripped(str):
    if tab_spaced:
        return str.lstrip('\t').rstrip('\n\r')
    else:
        return str.lstrip().rstrip('\n\r')

def current_ws():
    if whitespacing == 0 or not tab_spaced:
        return len(line) - len(line.lstrip())
    if tab_spaced:
        return len(line) - len(line.lstrip('\t\n\r'))

def get_nested(adict,anarray,i):
    if i != None:
        i += 1
        if anarray[i] in adict:
            return get_nested(adict[anarray[i]],anarray)
        else:
            return adict
    else:
        i = 0
        return get_nested(adict[anarray[i]],anarray)

#initialise variables
jsondict = {}
unclosed_tags = []
debug = []

vividfilename = 'simple.vivid'
# vividfilename = sys.argv[1]
if len(sys.argv)>2:
    jsfilename = sys.argv[2]
else:
    jsfilename = vividfilename.split('.')[0] + '.json'

whitespacing = 0
whitespace_array = [0,0]
tab_spaced = False

#open the file
with codecs.open(vividfilename,'rU', "utf-8-sig") as vividfile:
    for line in vividfile:
        #work out how many whitespaces at start
        whitespace_array.append(current_ws())

        #For first line with whitespace, work out the whitespacing (eg tab vs 4-space)
        if whitespacing == 0 and whitespace_array[-1] > 0:
            whitespacing = whitespace_array[-1]
            if line[0] == '\t':
                tab_spaced = True

        #strip out whitespace at start and end
        stripped_line = stripped(line)

        if whitespace_array[-1] == 0:
            jsondict[stripped_line] = ""
            unclosed_tags.append(stripped_line)

        if whitespace_array[-2] < whitespace_array[-1]:
            oldnested = get_nested(jsondict,whitespace_array,None)
            print oldnested
            # jsondict.pop(unclosed_tags[-1])
            # jsondict[unclosed_tags[-1]]={stripped_line:""}
            # unclosed_tags.append(stripped_line)

        print jsondict
        print unclosed_tags

print jsondict
print unclosed_tags

Here is a recursive solution. 这是一个递归解决方案。 First, transform the input in the following way. 首先,按以下方式转换输入。

Input: 输入:

person:
    address:
        street1: 123 Bar St
        street2: 
        city: Madison
        state: WI
        zip: 55555
    web:
        email: boo@baz.com

First-step output: 第一步输出:

[{'name':'person','value':'','level':0},
 {'name':'address','value':'','level':1},
 {'name':'street1','value':'123 Bar St','level':2},
 {'name':'street2','value':'','level':2},
 {'name':'city','value':'Madison','level':2},
 {'name':'state','value':'WI','level':2},
 {'name':'zip','value':55555,'level':2},
 {'name':'web','value':'','level':1},
 {'name':'email','value':'boo@baz.com','level':2}]

This is easy to accomplish with split(':') and by counting the number of leading tabs: 使用split(':')和计算前导标签的数量很容易实现:

def tab_level(astr):
    """Count number of leading tabs in a string
    """
    return len(astr)- len(astr.lstrip('\t'))

Then feed the first-step output into the following function: 然后将第一步输出提供给以下函数:

def ttree_to_json(ttree,level=0):
    result = {}
    for i in range(0,len(ttree)):
        cn = ttree[i]
        try:
            nn  = ttree[i+1]
        except:
            nn = {'level':-1}

        # Edge cases
        if cn['level']>level:
            continue
        if cn['level']<level:
            return result

        # Recursion
        if nn['level']==level:
            dict_insert_or_append(result,cn['name'],cn['value'])
        elif nn['level']>level:
            rr = ttree_to_json(ttree[i+1:], level=nn['level'])
            dict_insert_or_append(result,cn['name'],rr)
        else:
            dict_insert_or_append(result,cn['name'],cn['value'])
            return result
    return result

where: 哪里:

def dict_insert_or_append(adict,key,val):
    """Insert a value in dict at key if one does not exist
    Otherwise, convert value to list and append
    """
    if key in adict:
        if type(adict[key]) != list:
            adict[key] = [adict[key]]
        adict[key].append(val)
    else:
        adict[key] = val

The following code will take a block-indented file and convert into an XML tree; 以下代码将采用块缩进文件并转换为XML树; this: 这个:

foo
bar
baz
  ban
  bal

...becomes: ...变为:

<cmd>foo</cmd>
<cmd>bar</cmd>
<block>
  <name>baz</name>
  <cmd>ban</cmd>
  <cmd>bal</cmd>
</block>

The basic technique is: 基本技术是:

  1. Set indent to 0 将缩进设置为0
  2. For each line, get the indent 对于每一行,获取缩进
  3. If > current, step down and save current block/ident on a stack 如果> current,则降低并在堆栈上保存当前块/标识
  4. If == current, append to current block 如果== current,则追加到当前块
  5. If < current, pop from the stack until you get to the matching indent 如果<current,则从堆栈弹出,直到找到匹配的缩进

So: 所以:

from lxml import builder
C = builder.ElementMaker()

def indent(line):
    strip = line.lstrip()
    return len(line) - len(strip), strip

def parse_blockcfg(data):
    top = current_block = C.config()
    stack = []
    current_indent = 0

    lines = data.split('\n')
    while lines:
        line = lines.pop(0)
        i, line = indent(line)

        if i==current_indent:
            pass

        elif i > current_indent:
            # we've gone down a level, convert the <cmd> to a block
            # and then save the current ident and block to the stack
            prev.tag = 'block'
            prev.append(C.name(prev.text))
            prev.text = None
            stack.insert(0, (current_indent, current_block))
            current_indent = i
            current_block = prev

        elif i < current_indent:
            # we've gone up one or more levels, pop the stack
            # until we find out which level and return to it
            found = False
            while stack:
                parent_indent, parent_block = stack.pop(0)
                if parent_indent==i:
                    found = True
                    break
            if not found:
                raise Exception('indent not found in parent stack')
            current_indent = i
            current_block = parent_block

        prev = C.cmd(line)
        current_block.append(prev)

    return top

Here is an object oriented approach based on a composite structure of nested Node objects. 这是一种基于嵌套Node对象的复合结构的面向对象方法。

Input: 输入:

indented_text = \
"""
apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10
"""

a Node class 一个Node类

class Node:
    def __init__(self, indented_line):
        self.children = []
        self.level = len(indented_line) - len(indented_line.lstrip())
        self.text = indented_line.strip()

    def add_children(self, nodes):
        childlevel = nodes[0].level
        while nodes:
            node = nodes.pop(0)
            if node.level == childlevel: # add node as a child
                self.children.append(node)
            elif node.level > childlevel: # add nodes as grandchildren of the last child
                nodes.insert(0,node)
                self.children[-1].add_children(nodes)
            elif node.level <= self.level: # this node is a sibling, no more children
                nodes.insert(0,node)
                return

    def as_dict(self):
        if len(self.children) > 1:
            return {self.text: [node.as_dict() for node in self.children]}
        elif len(self.children) == 1:
            return {self.text: self.children[0].as_dict()}
        else:
            return self.text

To parse the text, first create a root node. 要解析文本,请首先创建根节点。 Then, remove empty lines from the text, and create a Node instance for every line, pass this to the add_children method of the root node. 然后,从文本中删除空行,并为每一行创建一个Node实例,将其传递给add_children方法。

root = Node('root')
root.add_children([Node(line) for line in indented_text.splitlines() if line.strip()])
d = root.as_dict()['root']
print(d)

result: 结果:

{'apple': [
  {'colours': ['red', 'yellow', 'green']},
  {'type': 'granny smith'},
  {'price': '0.10'}]
}

I think that it should be possible to do it in one step, where you simply call the constructor of Node once, with the indented text as an argument. 我认为应该可以在一步中完成它,你只需要调用Node的构造函数一次,并将缩进的文本作为参数。

First of all, don't use array and dict as variable names because they're reserved words in Python and reusing them may end up in all sorts of chaos. 首先,不要使用arraydict作为变量名,因为它们是Python中的保留字并且重用它们可能最终会出现各种混乱。

OK so if I get you correctly, you have a tree given in a text file, with parenthood indicated by indentations, and you want to recover the actual tree structure. 好吧,如果我找到你的话,你会在一个文本文件中给出一个树,并用缩进表示父母,并且你想要恢复实际的树结构。 Right? 对?

Does the following look like a valid outline? 以下内容是否有效? Because I have trouble putting your current code into context. 因为我无法将当前代码放入上下文中。

result = {}
last_indentation = 0
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
   # sibling to last
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level

OK then your list are the current parents, that's in fact right - but I'd keep them pointed to the dictionary you've created, not the literal letter 那么你的列表是当前的父母,这实际上是正确的 - 但我会让他们指向你创建的字典,而不是文字字母

just starting some stuff here 刚开始做一些东西

result = {}
parents = {}
last_indentation = 1 # start with 1 so 0 is the root of tree
parents[0] = result
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
       new_el = {}
       parents[i-1][c] = new_el
       parents[i] = new_el
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM