[英]A parser which accepts any string in Scala?
I'm writing a Scala parser for the following grammar: 我正在为以下语法编写Scala解析器:
expr := "<" anyString ">" "<" anyString ">"
anyString := // any string
For example, "<foo> <bar>"
is a valid string, as is "<http://www.example.com/example> <123>"
, and "<1> <_hello>"
例如,
"<foo> <bar>"
, "<http://www.example.com/example> <123>"
和"<1> <_hello>"
都是有效的字符串。
So far, I have the following: 到目前为止,我有以下内容:
object MyParser extends JavaTokenParsers {
override def skipWhitespace = false
def expr: Parser[Any] = "<" ~ anyString ~ ">" ~ whiteSpace ~ "<" ~ anyString ~ ">"
def anyString = ???
}
My questions are the following (I've included my suspected answer, but please confirm anyway, if I'm correct!): 我的问题如下(我已经提供了我的可疑答案,但是如果我正确的话,请务必确认!):
How to implement a regex parser which accepts any string? 如何实现一个接受任何字符串的正则表达式解析器? This must have an almost trivial answer, like
def anyString = """\\a*""".r
, where \\a
is the symbol which represents any character (although \\a
is probably not the droid I'm looking for). 这必须有一个几乎平凡的答案,例如
def anyString = """\\a*""".r
,其中\\a
是代表任何字符的符号(尽管\\a
可能不是我要查找的机器人)。
If I set anyString
to accept any string, will it stop before the >
symbol or will it run until the end of the string and fail? 如果我将
anyString
设置为接受任何字符串,它会在>
符号前停止还是会一直运行到字符串末尾并失败? I believe it will run until the end of the string and fail, and then it will eventually find the >
and consume up to there. 我相信它将一直运行到字符串末尾并失败,然后它将最终找到
>
并消耗到那里。 This seems to result in a very inefficient parser, and any comments on this would be appreciated! 这似乎导致解析器效率非常低下,对此不胜感激!
What if the string within <
and >
contains a >
symbol (eg <fo>o> <bar>
)? 如果
<
和>
中的字符串包含>
符号(例如<fo>o> <bar>
)怎么办? Will anyString
consume until the first >
or the last one? anyString
会消耗直到第一个>
或最后一个吗? Is there any way to specify whether it consumes the least it can, or the most? 有什么方法可以指定它消耗的最少还是最多?
In order to fix the previous point, I'd like to forbid <
>
in anyString
. 为了解决上一点,我想在
anyString
禁止<
>
。 How to write that?. 怎么写?
Thank you! 谢谢!
I'm currently researching my own question, and I'll try to answer myself here. 我目前正在研究自己的问题,在这里我会尽力回答。
The Java Pattern
documentation specifies that .
Java
Pattern
文档指定了.
matches any character. 匹配任何字符。 Therefore, the regex which accepts any string would be:
因此,接受任何字符串的正则表达式为:
def anyString = ".*".r
To accept any non-empty string, we can use ".+".r
. 要接受任何非空字符串,我们可以使用
".+".r
。
To understand this, consider the following toy example: 要理解这一点,请考虑以下玩具示例:
object MyParser1 { override def skipWhitespace = false def expr = "<" ~ anyString ~ ">" def anyString = ".*".r }
Here, the string <>
is rejected . 在这里,字符串
<>
被拒绝 。 To test this, use: 要对此进行测试,请使用:
println( MyParser1.parseAll(MyParser1.expr, "<>") )
This indicates that the .*
parser is consuming until the end of the string, whereby the >
is not available for the final parser. 这表明
.*
解析器正在使用直到字符串的末尾,从而>
不可用于最终解析器。 Therefore, it seems to be necessary to forbid <
and >
form appearing in anyString
. 因此,似乎有必要禁止在
anyString
出现<
和>
形式。
As in the previous point, the .*
parser consumes the whole string , and therefore consumes all >
symbols. 与上一点一样,
.*
解析器使用整个字符串 ,因此使用所有>
符号。
In the same documentation, a negation operator is given. 在同一文档中,给出了否定运算符。 To exclude
<
and >
, we can write: 要排除
<
和>
,我们可以这样写:
def almostAnyString = "[^<>]*".r
In general, the construct [^abc]
will match any character except a
, b
, and c
. 通常,构造
[^abc]
将匹配a
, b
和c
之外 a
任何字符。
To conclude, the best implementation I've found so far is the following: 总而言之,到目前为止,我发现的最佳实现是:
object MyParser extends JavaTokenParsers {
override def skipWhitespace = false // don't allow whitespace between parsers by default
def expr: Parser[Any] = "<" ~ almostAnyString ~ ">" ~
whiteSpace ~ // this parser is defined in JavaTokenParsers
"<" ~ almostAnyString ~ ">"
def almostAnyString = "[^<>]*".r
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.