简体   繁体   中英

the trick to nested structures in pyparsing

I am struggling to parse nested structures with PyParsing. I've searched many of the 'nested' example uses of PyParsing , but I don't see how to fix my problem.

Here is what my internal structure looks like:

texture_unit optionalName
{
    texture required_val
    prop_name1 prop_val1
    prop_name2 prop_val1
}

and here is what my external structure looks like, but it can contain zero or more of the internal structures.

pass optionalName
{
    prop_name1 prop_val1
    prop_name2 prop_val1

    texture_unit optionalName
    {
        // edit 2: showing use of '.' character in value
        texture required_val.file.name optional_val // edit 1: forgot this line in initial post.

        // edit 2: showing potentially multiple values
        prop_name3 prop_val1 prop_val2
        prop_name4 prop_val1
    }
}

I am successfully parsing the internal structure. Here is my code for that.

prop_ = pp.Group(pp.Word(pp.alphanums+'_')+pp.Group(pp.OneOrMore(pp.Word(pp.alphanums+'_'+'.'))))
texture_props_ = pp.Group(pp.Literal('texture') + pp.Word(pp.alphanums+'_'+'.')) + pp.ZeroOrMore(prop_)
texture_ = pp.Forward()
texture_ << pp.Literal('texture_unit').suppress() + pp.Optional(pp.Word(pp.alphanums+'_')).suppress() + pp.Literal('{').suppress() + texture_props_ + pp.Literal('}').suppress()

Here is my attempt to parse the outer structure,

pass_props_ = pp.ZeroOrMore(prop_)
pass_ = pp.Forward()
pass_ << pp.Literal('pass').suppress() + pp.Optional(pp.Word(pp.alphanums+'_'+'.')).suppress() + pp.Literal('{').suppress() + pass_props_ + pp.ZeroOrMore(texture_) + pp.Literal('}').suppress()

When I say: pass_.parseString( testPassStr )

I see errors in the console that "}" was expected.

I see this as very similar to the C struct example , but I'm not sure what is the missing magic. I'm also curious how to control the resulting data structure when using the nestedExpr .

There are two problems:

  1. In your grammar you marked texture literal as required in texture_unit block, but there is no texture in your second example.
  2. In second example, pass_props_ coincides with texture_unit optionalName . After it, pp.Literal('}') expects } , but gives { . This is the reason for the error.

We can check it by changing the pass_ rule like this:

pass_ << pp.Literal('pass').suppress() + pp.Optional(pp.Word(pp.alphanums+'_'+'.')).suppress() + \
             pp.Literal('{').suppress() + pass_props_

print pass_.parseString(s2)

It gives us follow output:

[['prop_name', ['prop_val', 'prop_name', 'prop_val', 'texture_unit', 'optionalName']]]

We can see that pass_props_ coincides with texture_unit optionalName .
So, what we want to do: prop_ can contains alphanums , _ and . , but can not match with texture_unit literal. We can do it with regex and negative lookahead :

prop_ = pp.Group(  pp.Regex(r'(?!texture_unit)[a-z0-9_]+')+ pp.Group(pp.OneOrMore(pp.Regex(r'(?!texture_unit)[a-z0-9_.]+'))) )

Finally, working example will look like this:

import pyparsing as pp

s1 = '''texture_unit optionalName
    {
    texture required_val
    prop_name prop_val
    prop_name prop_val
}'''

prop_ = pp.Group(  pp.Regex(r'(?!texture_unit)[a-z0-9_]+')+ pp.Group(pp.OneOrMore(pp.Regex(r'(?!texture_unit)[a-z0-9_.]+'))) )
texture_props_ = pp.Group(pp.Literal('texture') + pp.Word(pp.alphanums+'_'+'.')) + pp.ZeroOrMore(prop_)
texture_ = pp.Forward()
texture_ = pp.Literal('texture_unit').suppress() + pp.Word(pp.alphanums+'_').suppress() +\
           pp.Literal('{').suppress() + pp.Optional(texture_props_) + pp.Literal('}').suppress()

print texture_.parseString(s1)

s2 = '''pass optionalName
{
    prop_name1 prop_val1.name
    texture_unit optionalName1
    {
        texture required_val1
        prop_name2 prop_val12
        prop_name3 prop_val13
    }
    texture_unit optionalName2
    {
        texture required_va2l
        prop_name2 prop_val22
        prop_name3 prop_val23
    }
}'''

pass_props_ = pp.ZeroOrMore(prop_  )
pass_ = pp.Forward()

pass_ = pp.Literal('pass').suppress() + pp.Optional(pp.Word(pp.alphanums+'_'+'.')).suppress() +\
        pp.Literal('{').suppress() + pass_props_ + pp.ZeroOrMore(texture_ ) + pp.Literal('}').suppress()

print pass_.parseString(s2)

Output:

[['texture', 'required_val'], ['prop_name', ['prop_val', 'prop_name', 'prop_val']]]
[['prop_name1', ['prop_val1.name']], ['texture', 'required_val1'], ['prop_name2', ['prop_val12', 'prop_name3', 'prop_val13']], ['texture', 'required_va2l'], ['prop_name2', ['prop_val22', 'prop_name3', 'prop_val23']]]

The answer I was looking for is related to the use of the 'Forward' parser, shown in the Cstruct example (linked in OP).

The hard part of defining grammar for nested strcture is to define all the possible member types of the structure, which needs to include the structure itself, which is still not defined.

The "trick" to defining the pyparsing grammar for a nested structure is to delay the definition of the structure, but include a "forward declared" version of the structure when defining the structure members, so the members can also include a structure. Then complete the structure grammar as a list of members.

struct = Forward()
member = blah | blah2 | struct
struct << ZeroOrMore( Group(member) )

This is also discussed over here: Pyparsing: Parsing semi-JSON nested plaintext data to a list

The OP (mine) described test data and grammar that was not specific enough and matched when it should have failed. @NorthCat correctly spotted the undesired matches in the grammar. However, the suggestion to define many 'negative lookaheads' seemed unmanageable.

Instead of defining what should not match, my solution instead explicitly listed the possible matches. The matches were member keywords, using 'oneOf('list of words separated by space'). Once I specified all the possible matches, I realized my structure was not a nested structure, but actually a structure with finite depth and different grammars described each depth. So, my member definition did not require the Forward declaration trick.

The terminator of my member definitions was different than in the Cstruct example. Instead of terminating with a ';' (semi-colon) like in C++, my member definitions needed to terminate at the end of the line. In pyparsing, you can specify the end of the line with 'LineEnd' parser. So, I defined my members as a list of values NOT including the 'LineEnd', like this, notice the use of the "Not" (~) operator in the last definition:

EOL = LineEnd().suppress()
ident = Word( alphas+"_", alphanums+"_$@#." )
integer = Word(nums)
real = Combine(Optional(oneOf('+ -')) + Word(nums) + '.' + Optional(Word(nums)))
propVal = real | integer | ident
propList = Group(OneOrMore(~EOL + propVal))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM