可以在Scala中接受任何字符串的解析器吗？

Question

I'm writing a Scala parser for the following grammar: 我正在为以下语法编写Scala解析器：

expr := "<" anyString ">" "<" anyString ">"
anyString := // any string

For example, "<foo> <bar>" is a valid string, as is "<http://www.example.com/example> <123>" , and "<1> <_hello>" 例如， "<foo> <bar>" ， "<http://www.example.com/example> <123>"和"<1> <_hello>"都是有效的字符串。

So far, I have the following: 到目前为止，我有以下内容：

object MyParser extends JavaTokenParsers {

  override def skipWhitespace = false

  def expr: Parser[Any] = "<" ~ anyString ~ ">" ~ whiteSpace ~ "<" ~ anyString ~ ">"

  def anyString = ???

}

My questions are the following (I've included my suspected answer, but please confirm anyway, if I'm correct!): 我的问题如下（我已经提供了我的可疑答案，但是如果我正确的话，请务必确认！）：

How to implement a regex parser which accepts any string? 如何实现一个接受任何字符串的正则表达式解析器？ This must have an almost trivial answer, like def anyString = """\\a*""".r , where \\a is the symbol which represents any character (although \\a is probably not the droid I'm looking for). 这必须有一个几乎平凡的答案，例如def anyString = """\\a*""".r ，其中\\a是代表任何字符的符号（尽管\\a可能不是我要查找的机器人）。
If I set anyString to accept any string, will it stop before the > symbol or will it run until the end of the string and fail? 如果我将anyString设置为接受任何字符串，它会在>符号前停止还是会一直运行到字符串末尾并失败？ I believe it will run until the end of the string and fail, and then it will eventually find the > and consume up to there. 我相信它将一直运行到字符串末尾并失败，然后它将最终找到>并消耗到那里。 This seems to result in a very inefficient parser, and any comments on this would be appreciated! 这似乎导致解析器效率非常低下，对此不胜感激！
What if the string within < and > contains a > symbol (eg <fo>o> <bar> )? 如果<和>中的字符串包含>符号（例如<fo>o> <bar> ）怎么办？ Will anyString consume until the first > or the last one? anyString会消耗直到第一个>或最后一个吗？ Is there any way to specify whether it consumes the least it can, or the most? 有什么方法可以指定它消耗的最少还是最多？
In order to fix the previous point, I'd like to forbid < > in anyString . 为了解决上一点，我想在anyString禁止< > 。 How to write that?. 怎么写？

Thank you! 谢谢！

Answer 1

I'm currently researching my own question, and I'll try to answer myself here. 我目前正在研究自己的问题，在这里我会尽力回答。

The Java Pattern documentation specifies that . Java Pattern 文档指定了. matches any character. 匹配任何字符。 Therefore, the regex which accepts any string would be: 因此，接受任何字符串的正则表达式为：
```
 def anyString = ".*".r 
```
To accept any non-empty string, we can use ".+".r . 要接受任何非空字符串，我们可以使用".+".r 。
To understand this, consider the following toy example: 要理解这一点，请考虑以下玩具示例：
```
  object MyParser1 { override def skipWhitespace = false def expr = "<" ~ anyString ~ ">" def anyString = ".*".r } 
```
Here, the string <> is rejected . 在这里，字符串<>被拒绝。 To test this, use: 要对此进行测试，请使用：
```
 println( MyParser1.parseAll(MyParser1.expr, "<>") ) 
```
This indicates that the .* parser is consuming until the end of the string, whereby the > is not available for the final parser. 这表明.*解析器正在使用直到字符串的末尾，从而>不可用于最终解析器。 Therefore, it seems to be necessary to forbid < and > form appearing in anyString . 因此，似乎有必要禁止在anyString出现<和>形式。
As in the previous point, the .* parser consumes the whole string , and therefore consumes all > symbols. 与上一点一样， .*解析器使用整个字符串 ，因此使用所有>符号。
In the same documentation, a negation operator is given. 在同一文档中，给出了否定运算符。 To exclude < and > , we can write: 要排除<和> ，我们可以这样写：
```
 def almostAnyString = "[^<>]*".r 
```
In general, the construct [^abc] will match any character except a , b , and c . 通常，构造[^abc]将匹配a ， b和c 之外 a任何字符。

To conclude, the best implementation I've found so far is the following: 总而言之，到目前为止，我发现的最佳实现是：

object MyParser extends JavaTokenParsers {
  override def skipWhitespace = false // don't allow whitespace between parsers by default

  def expr: Parser[Any] = "<" ~ almostAnyString ~ ">" ~
                          whiteSpace ~ // this parser is defined in JavaTokenParsers
                          "<" ~ almostAnyString ~ ">"

  def almostAnyString = "[^<>]*".r

}

可以在Scala中接受任何字符串的解析器吗？

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-02-28 13:02:50

可以在Scala中接受任何字符串的解析器吗？

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-02-28 13:02:50

解决方案1
1 已采纳 2014-02-28 13:02:50