简体   繁体   English

Python:解析文本文件并创建树结构

[英]Python: Parse text file and create tree structure

What's the best way to parse a plain text tree structure like this: 解析纯文本树结构的最佳方法是什么:

node1:
    node1
    node2:
        node1
node2:
    node1
    node2
    node3:
        node1:
            node1
        node2:
            node1
            node2

and convert it into a tree structure (with lists or dictionaries)? 并将其转换为树结构(带有列表或字典)?

Is there any python library to help me with the parsing? 是否有任何python库可以帮助我进行解析?

The rson library will do this, except you'll probably have to subclass the parser to allow the mixing of array and dict style elements in a single structure. rson库将执行此操作,除非您可能必须对解析器进行子类化以允许在单个结构中混合数组和dict样式元素。

Edit Actually, that might be a bit difficult, but the rsonlite package will (sort of) work as-is with your data. 编辑实际上,这可能有点困难,但是rsonlite包将(按某种方式)与您的数据一起使用。

rsonlite is a small, single-module package that is only 300 lines long, and the same source works with both Python 2 and Python 3. rsonlite是一个小型的单模块程序包,只有300行长,并且相同的源代码可用于Python 2和Python 3。

Here is an example that shows 3 different outputs from your data. 这是一个示例,显示了数据的3种不同输出。 The first output is what rsonlite.dumps() gives, the second output is what the slightly higher-level rsonlite.simpleparse() gives, and the third output takes the results from simpleparse and runs them through a custom fixup() function to create a pure nested dictionary data structure, where any missing value is simply set to None , and all the colon characters are checked and stripped. 第一个输出是rsonlite.dumps()给出的结果,第二个输出是稍高​​级别的rsonlite.simpleparse()给出的结果,第三个输出从simpleparse获取结果并通过自定义fixup()函数运行它们以创建纯嵌套的字典数据结构,其中所有缺少的值都简单地设置为None ,并检查并去除所有冒号字符。

from rsonlite import loads, simpleparse


mystring = '''
node1:
    node1
    node2:
        node1
node2:
    node1
    node2
    node3:
        node1:
            node1
        node2:
            node1
            node2
'''

def fixup(node):
    if isinstance(node, str):
        return node
    elif isinstance(node, dict):
        for key in node:
            assert key.endswith(':'), key
        return dict((key[:-1], fixup(value)) for key, value in node.items())
    else:
        assert isinstance(node, (list, str))
        result = {}
        for item in node:
            if isinstance(item, str):
                assert not item.endswith(':')
                assert result.setdefault(item, None) is None
            else:
                for key, value in fixup(item).items():
                    assert result.setdefault(key, value) is value
        return result

print('')
print(loads(mystring))
print('')
print(simpleparse(mystring))
print('')
print(fixup(simpleparse(mystring)))
print('')

Will give: 会给:

[('node1:', ['node1', ('node2:', ['node1'])]), ('node2:', ['node1', 'node2', ('node3:', [('node1:', ['node1']), ('node2:', ['node1', 'node2'])])])] [('node1:',['node1',('node2:',['node1']))]),(('node2:',['node1','node2',('node3:',[( 'node1:',['node1']),('node2:',['node1','node2'])]))))))]]

OrderedDict([('node1:', ['node1', OrderedDict([('node2:', 'node1')])]), ('node2:', ['node1', 'node2', OrderedDict([('node3:', OrderedDict([('node1:', 'node1'), ('node2:', ['node1', 'node2'])]))])])]) OrderedDict([('node1:',['node1',OrderedDict([('node2:','node1')])))))(('node2:',['node1','node2',OrderedDict([ ('node3:',OrderedDict([('node1:','node1'),('node2:',['node1','node2'])])])))))))))))))

{'node1': {'node1': None, 'node2': 'node1'}, 'node2': {'node1': None, 'node2': None, 'node3': {'node1': 'node1', 'node2': {'node1': None, 'node2': None}}}} {'node1':{'node1':无,'node2':'node1'},'node2':{'node1':无,'node2':无,'node3':{'node1':'node1' ,'node2':{'node1':无,'node2':无}}}}

You could construct a simple parser that generates a valid python expression from the input and then evaluate it. 您可以构造一个简单的解析器,该解析器从输入生成一个有效的python表达式,然后对其求值。 My initial thought had been a simple recursive parser, but that's more difficult than I anticipated because their is no way to know the block is ending without a peak ahead - a common problem with indent based formats. 我最初的想法是一个简单的递归解析器,但是这比我预期的要困难得多,因为他们无法知道该块在没有峰值之前就结束了-这是基于缩进格式的常见问题。

This generates nested list of tuples (block_name, [contents]): 这将生成元组的嵌套列表(block_name,[contents]):

i = 0
r = '['
for l in mystring.split('\n'):
    if not l:
        continue
    cl = l.lstrip(' ')
    ci = (len(l) - len(cl))//4
    if ci > i:           # line indented
        r += '['
    elif ci < i:         # line unindented, can be multiple
        r += '])'*(i-ci) + ','
    if cl[-1] == ':':    # new block
        r += '{("{}":'.format(cl[:-1])
    else:                # new item
        r += '"{}",'.format(cl)
    i = ci
r += ']'+')]'*i
eval(r)

Output: 输出:

[('node1', ['node1', ('node2', ['node1'])]),
 ('node2',
  ['node1',
   'node2',
   ('node3', [('node1', ['node1']), ('node2', ['node1', 'node2'])])])]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM