简体   繁体   中英

Parsing a language using scala parser combinators

I have the following template:

#foo(args)# // START CONTAINER1
  #foo(foo <- foos)(args)# // BLOCK STARTS HERE (`args` can be on either side of `block`)
     #bar(args)# // START CONTAINER2
     #.bar# // END CONTAINER2
  #.foo# // END BLOCK
#.foo# // END CONTAINER1

*notice how #.foo# closes each container/block

The trouble I see here is that there's no unique id of some sort to represent each block so I have to keep track of how many container openers/closers there are ( #foo# / #.foo# ) so that a block with an inside container's END CONTAINER hash won't confuse the parser as ending the block.

How would I use Scala's parsers to parse blocks in a language like this?


I started off with this:

def maybeBlockMaybeJustContainer:Content = {
  (openingHash ~ identifier ~ opt(args) ~> opt(blockName) <~ opt(args) ~ closingHash) ~ 
      opt(content) ~
  openHash ~ dot ~ identifier ~ closingHash ^^ ...
}

I'm also thinking about preprocessing it but not sure where to start.

For your language constuct something similar to BNF in the form

//Each of these is of type Parser (or String, which will be implicity converted to Parser when needed).
lazy val container = containerHeader ~ containerBody ~ containerEnd
lazy val containerHeader = hash ~ identifier ~ opt(args) ~ hash
lazy val containerBody = rep(block)
....
lazy val identifier = regex(new Regex("[a-zA-Z0-9-]+"))
lazy val hash = "#"

If your parser accepts a string, then the string in a member of the language defined by this parser.

This is a parser for a context-free language . Context free languages include those of the form a[x]Sb[x] where [x] indicates that the previous symbol has be exist x times, where x is undefined by the grammar, but rather is different for each string. (If x were defined for the grammar, then the language would be finite, and all finite languages are regular.)

This means that the language allows for nesting, or recursive components, such as your blocks and containers.

If you start parsing a container, then a block inside that container, you will not finish parsing the contain until the block has been fully parsed. This is true for all strings in your language.

Once you have you grammar defined and it is properly accepting and rejecting test cases then you can work on hooking it up to your AST .

lazy val identifier:Parser[Identifier] = regex(new Regex("[a-zA-Z0-9-]+")) ^^ {case s => Identifier(s)}

Note how this now has the type Parser[Identifier] , ie its a parser that if parses correctly will return an Identifier . This is used in more complex cases as

lazy val container:Parser[Container] = containerHeader ~ containerBody ~ containerEnd ^^ {case head ~ body ~ end => Container(head.identifier,body)}

Let me know if any of this needs expanded upon.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM