简体   繁体   中英

Pyparsing: nested Markdown emphasis

I'm noodling around with some simple Markdown text to play with and learn Pyparsing and grammars in general. I've run into a problem almost immediately that I'm having trouble solving. I'm trying to parse a simple version of the CommonMark spec for emphasis. In this setup, nested emphasis is allowed, so that

*foo *bar* baz*

should give:

<em>foo <em>bar</em> baz</em>

I've tried using a recursive definition to match this, but it's not working. Here's some sample code:

from pyparsing import *

text = Word(printables,excludeChars="*")
enclosed = Forward()
emphasis = QuotedString("*").setParseAction(lambda x: "<em>%s</em>" % x[0],contents=enclosed)
enclosed << emphasis | text

test = """
*foo *bar* bar*
"""

print emphasis.transformString(test)

But what I get back from this is:

<em>foo </em>bar<em> bar</em>

Forgive my noobishness; can someone point me in the right direction?

Edit :

In response to abarnert's great probing question, I'll provide clarification. I'm just playing around, so I can use an arbitrarily restricted form of the notation. I'll assume that only single '*'s occur, and that they don't occur next to each other. That leaves the whitespace to disambiguate: * not followed by whitespace opens emphasis, and * not preceeded by whitespace closes it.

Even with that, I'm not sure how to proceed with Pyparsing. Some sort of stack-based approach, pushing opening * and popping them when they validate as closing? How would one do that with Pyparsing? Or is there a more efficient approach?

Think about what you're asking for. When does a second * close emphasis, and when does it open a nested emphasis? You have written no rules to distinguish that. Since it's always 100% ambiguous, that means the only possible outcomes you can get are:

  • No emphasis can ever be closed, or
  • No emphasis can ever be nested.

I doubt you're asking how to switch from the second to the first.

So then what are you asking for?

You need to implement some kind of rule to disambiguate these two possibilities.

In fact, if you read the docs you linked to, they have a complicated set of rules that define exactly when a * can open emphasis and when it can't, and likewise for closng; given those rules, if it's still ambiguous, it closes emphasis. You have to implement that.

With those additional rules, I don't think you need to worry about the recursion at all, just handle the opening and closing emphasis expressions as they are found, whether they match up or not:

from pyparsing import *

openEmphasis = (LineStart() | White()) + Suppress('*')
openEmphasis.setParseAction(lambda x: ''.join(x.asList()+['<em>']))
closeEmphasis = '*' + FollowedBy(White() | LineEnd())
closeEmphasis.setParseAction(lambda x: '</em>')

emphasis = (openEmphasis | closeEmphasis).leaveWhitespace()

test = """
*foo *bar* bar*
"""
print test
print emphasis.transformString(test)

Prints:

*foo *bar* bar*

<em>foo <em>bar</em> bar</em>

You are not the first to trip over this kind of application. When I presented at PyCon'06, an eager attendee dove right in to parse out some markdown, with an input string something like "****a** b**** c**" or something. We worked on it a bit together, but the disambiguation rules were just too context-aware for a basic pyparsing parser to handle.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM