简体   繁体   English

可以在Scala中接受任何字符串的解析器吗?

[英]A parser which accepts any string in Scala?

I'm writing a Scala parser for the following grammar: 我正在为以下语法编写Scala解析器:

expr := "<" anyString ">" "<" anyString ">"
anyString := // any string

For example, "<foo> <bar>" is a valid string, as is "<http://www.example.com/example> <123>" , and "<1> <_hello>" 例如, "<foo> <bar>""<http://www.example.com/example> <123>""<1> <_hello>"都是有效的字符串。

So far, I have the following: 到目前为止,我有以下内容:

object MyParser extends JavaTokenParsers {

  override def skipWhitespace = false

  def expr: Parser[Any] = "<" ~ anyString ~ ">" ~ whiteSpace ~ "<" ~ anyString ~ ">"

  def anyString = ???

}

My questions are the following (I've included my suspected answer, but please confirm anyway, if I'm correct!): 我的问题如下(我已经提供了我的可疑答案,但是如果我正确的话,请务必确认!):

  1. How to implement a regex parser which accepts any string? 如何实现一个接受任何字符串的正则表达式解析器? This must have an almost trivial answer, like def anyString = """\\a*""".r , where \\a is the symbol which represents any character (although \\a is probably not the droid I'm looking for). 这必须有一个几乎平凡的答案,例如def anyString = """\\a*""".r ,其中\\a是代表任何字符的符号(尽管\\a可能不是我要查找的机器人)。

  2. If I set anyString to accept any string, will it stop before the > symbol or will it run until the end of the string and fail? 如果我将anyString设置为接受任何字符串,它会在>符号前停止还是会一直运行到字符串末尾并失败? I believe it will run until the end of the string and fail, and then it will eventually find the > and consume up to there. 我相信它将一直运行到字符串末尾并失败,然后它将最终找到>并消耗到那里。 This seems to result in a very inefficient parser, and any comments on this would be appreciated! 这似乎导致解析器效率非常低下,对此不胜感激!

  3. What if the string within < and > contains a > symbol (eg <fo>o> <bar> )? 如果<>中的字符串包含>符号(例如<fo>o> <bar> )怎么办? Will anyString consume until the first > or the last one? anyString会消耗直到第一个>或最后一个吗? Is there any way to specify whether it consumes the least it can, or the most? 有什么方法可以指定它消耗的最少还是最多?

  4. In order to fix the previous point, I'd like to forbid < > in anyString . 为了解决上一点,我想在anyString禁止< > How to write that?. 怎么写?

Thank you! 谢谢!

I'm currently researching my own question, and I'll try to answer myself here. 我目前正在研究自己的问题,在这里我会尽力回答。

  1. The Java Pattern documentation specifies that . Java Pattern 文档指定了. matches any character. 匹配任何字符。 Therefore, the regex which accepts any string would be: 因此,接受任何字符串的正则表达式为:

     def anyString = ".*".r 

    To accept any non-empty string, we can use ".+".r . 要接受任何非空字符串,我们可以使用".+".r

  2. To understand this, consider the following toy example: 要理解这一点,请考虑以下玩具示例:

      object MyParser1 { override def skipWhitespace = false def expr = "<" ~ anyString ~ ">" def anyString = ".*".r } 

    Here, the string <> is rejected . 在这里,字符串<>拒绝 To test this, use: 要对此进行测试,请使用:

     println( MyParser1.parseAll(MyParser1.expr, "<>") ) 

    This indicates that the .* parser is consuming until the end of the string, whereby the > is not available for the final parser. 这表明.*解析器正在使用直到字符串的末尾,从而>不可用于最终解析器。 Therefore, it seems to be necessary to forbid < and > form appearing in anyString . 因此,似乎有必要禁止在anyString出现<>形式。

  3. As in the previous point, the .* parser consumes the whole string , and therefore consumes all > symbols. 与上一点一样, .*解析器使用整个字符串 ,因此使用所有>符号。

  4. In the same documentation, a negation operator is given. 在同一文档中,给出了否定运算符。 To exclude < and > , we can write: 要排除<> ,我们可以这样写:

     def almostAnyString = "[^<>]*".r 

    In general, the construct [^abc] will match any character except a , b , and c . 通常,构造[^abc]将匹配abc 之外 a任何字符。

To conclude, the best implementation I've found so far is the following: 总而言之,到目前为止,我发现的最佳实现是:

object MyParser extends JavaTokenParsers {
  override def skipWhitespace = false // don't allow whitespace between parsers by default

  def expr: Parser[Any] = "<" ~ almostAnyString ~ ">" ~
                          whiteSpace ~ // this parser is defined in JavaTokenParsers
                          "<" ~ almostAnyString ~ ">"

  def almostAnyString = "[^<>]*".r

}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使正则表达式接受包含 0-9 最少 2 次并且可能包含“%”和“*”且字符可以在任何 position 中的字符串? - How to make regex that accepts string which contains 0-9 minimum 2 times and might contain "%" & "*" and character can be in any position? 需要接受表达式的字符串的正则表达式 - Need regular expression for a string which accepts expressions Scala Regexp:在包含预定义标记或由 2-4 个字符组成的任何单词的字符串中查找所有匹配项 - Scala Regexp: find all matches in a string that contain predefined token or any word which consists of 2-4 characters 接受数字或空字符串的可选字段的模式验证 - Schema validation for an optional field which accepts numbers or empty string Scala解析器组合器-快到了! - Scala parser combinator - almost there! 带有注释的Scala CSV分析器 - Scala CSV parser with comments 我需要一个正则表达式,它只接受一个字符串(仅包含字母和数字),并且在开始和结尾处都允许有空格,但不能在两者之间插入空格? - I need a Regex, that which accepts only a String(with only Alphabets and Numbers) with spaces allowed at the start and end, but not in between? Scala解析器和组合器:java.lang.RuntimeException:字符串匹配正则表达式“ \\ z” - Scala Parser and combinators: java.lang.RuntimeException: string matching regex `\z' expected Scala CSV解析器删除了空格 - Scala CSV parser removed spaces Scala解析器组合器中的Java正则表达式 - Java regex in a scala parser combinator
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM