简体   繁体   English

如何限制Regex和Parser组合器中的nestead标记?

[英]How to restrict nestead markup in Regex and Parser combinator?

I will like to implement a simple Wiki-like mark up parser as a exercise of using Scala parser combinator. 我想实现一个简单的类似Wiki的标记解析器,作为使用Scala解析器组合器的练习。

I would like to solve this bit by bit, so here is what I would like to achieve in the first version: a simple inline literal markup. 我想逐点解决这个问题,所以这是我想在第一个版本中实现的:一个简单的内联文字标记。

For example, if the input string is: 例如,如果输入字符串是:

This is a sytax test ``code here`` . Hello ``World``

The output string should be: 输出字符串应为:

This is a sytax test <code>code here</code> . Hello <code>World</code>

I try to solve this by using RegexParsers , and here is what I've done now: 我尝试使用RegexParsers来解决这个RegexParsers ,这就是我现在所做的:

import scala.util.parsing.combinator._
import scala.util.parsing.input._

object TestParser extends RegexParsers
{   
    override val skipWhitespace = false

    def toHTML(s: String) = "<code>" + s.drop(2).dropRight(2) + "</code>"

    val words = """(.)""".r
    val literal = """\B``(.)*``\B""".r ^^ toHTML

    val markup = (literal | words)*

    def run(s: String) = parseAll(markup, s) match {
        case Success(xs, next) => xs.mkString
        case _ => "fail"
    }
}

println (TestParser.run("This is a sytax test ``code here`` . Hello ``World``"))

In this code, a simpler input which only contains one <code> markup works fine, for example: 在此代码中,只包含一个<code>标记的更简单的输入正常工作,例如:

This is a sytax test ``code here``.

become 成为

This is a sytax test <code>code here</code>.

But when I run it with above example, it will yield 但是当我用上面的例子运行它时,它会产生

This is a sytax test <code>code here`` . Hello ``World</code>

I think this is because the regex I use: 我想这是因为我使用的正则表达式:

"""\B``(.)*``\B""".r

allowed any characters in `` pairs. 允许````任何字符。

I would like to know know should I limit there could not have nested `` and fix this problem? 我想知道我应该限制没有嵌套``并解决这个问题?

Here's some docs on non-greedy matching: 这里有一些关于非贪婪匹配的文档:

http://www.exampledepot.com/egs/java.util.regex/Greedy.html http://www.exampledepot.com/egs/java.util.regex/Greedy.html

Basically it's starting at the first `` and going as far as it can to get a match, which matches the `` at the end of world. 基本上它是从第一个`开始,并尽可能地得到一个匹配,匹配世界末尾的``。

By putting a ? 通过放一个? after your *, you tell it to do the shortest match possible, instead of the longest match. 在你的*之后,你告诉它做最短的比赛,而不是最长的比赛。

Another option is to use [^`]* (anything EXCEPT `), and that will force it to stop earlier. 另一种选择是使用[^`] *(除了`之外的任何东西),这将迫使它提前停止。

经过一些试验和错误后,我发现以下正则表达式似乎有效:

"""``(.)*?``"""

I don't know much about regex parsers, but you can use a simple 1-liner: 我对正则表达式解析器了解不多,但您可以使用简单的1-liner:

def addTags(s: String) =
  """(``.*?``)""".r replaceAllIn (
                    s, m => "<code>" + m.group(0).replace("``", "") + "</code>")

Test: 测试:

scala> addTags("This is a sytax test ``code here`` . Hello ``World``")
res0: String = This is a sytax test <code>code here</code> . Hello <code>World</code>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM