简体   繁体   English

使用SLR / LR / LALR解析器解析报价

[英]Parsing Quotes using a SLR/LR/LALR Parser

I'm trying to parse a list that is separated by spaces, but may include quotes that I have to treat as literals. 我正在尝试解析一个由空格分隔的列表,但可能包含必须视为文字的引号。 So I tried to write a grammar and parse it using my favorite parsing algorithm, but I can't seem to get the grammar right. 因此,我尝试编写一种语法并使用我最喜欢的解析算法对其进行解析,但是我似乎无法正确理解该语法。

The particular thing that makes it tricky is that I have to handle the " "" " case, which should be interpreted as one string with two quotes, whereas "" "" should be two empty strings. 使其变得棘手的特殊之处在于,我必须处理" "" "大小写,它应解释为一个带两个引号的字符串,而"" ""应该是两个空字符串。

To make the problem worse, I have to handle single quotes ' ' and comments bracketed by * * . 使问题变得更糟的是,我必须处理带引号* *单引号' '和注释。 Things like: ' * ' " * ' " are allowed, which should parse to * and * ' . 允许使用诸如' * ' " * ' "类的内容,这些内容应解析为** '

Is this just outright impossible or is there a grammar to do it? 这是完全不可能的吗?还是有语法可以做到?

The best try I've managed to come up with ( _ denotes space): 我设法提出的最佳尝试( _表示空格):

start -> argv $
argv -> argv _ term | term
term -> "" | '' | ** | "dqexpr" | 'sqexpr' | *comment* | expr
expr -> string without ", ', *, or _
dqexpr -> string without "
sqexpr -> string without '
comment -> string without *

But I can't be make this work on " "" " with a LR(1)/SLR(1) parser. 但是我无法使用LR(1)/ SLR(1)解析器在" "" "上进行此工作。

The grammar I attempted, for the simple case with no comments and only one quote type: 我尝试的语法为简单的情况,没有注释,只有一种引号类型:

START -> ARGV $
ARGV -> ARGV _ TERM
ARGV -> TERM
TERM -> q STRING q
TERM -> FREE
STRING -> STRING CHAR
CHAR -> ''
CHAR -> q
CHAR -> c
FREE -> FREE c
FREE -> c

Here, '' is epsilon; 此处, ''是ε”; q represents a quote, _ a space, and c any other character. q表示引号, _表示空格, c其他任何字符。 The grammar can be attempted using the on-line tool at http://jsmachines.sourceforge.net/machines/slr.html 可以使用http://jsmachines.sourceforge.net/machines/slr.html上的在线工具尝试语法。

The STRING non-terminal in your attempted grammar is useless (that is, it cannot derive any string of terminals) because it has no non-recursive production. 您尝试的语法中的STRING非终结符是没有用的 (也就是说,它不能派生任何终结符串),因为它没有非递归产生。 So a parser generator should discard it, along with the TERM -> q STRING q production. 因此,解析器生成器应将其与TERM -> q STRING q产生一起丢弃。 (Ideally, an error message to that effect would be generated.) If that were fixed, the production CHAR -> '' would generate an ambiguity, because a STRING can be any number of CHAR s, and you cannot tell how many epsilons there are in an empty string. (理想情况下,会产生这样的错误消息。)如果已解决,则生成CHAR -> ''将产生歧义,因为STRING可以是任意数量的CHAR ,并且您无法分辨出那里有多少个epsilons。在一个空字符串中。 Ideally, the parser generator would provide meaningful error messages although, as can be seen, not all do so. 理想情况下,解析器生成器将提供有意义的错误消息,尽管可以看出,并非所有人都这样做。

That can be fixed by simply changing CHAR -> '' to STRING -> '' , which will also solve the conflict generated by the ambiguity of concatenating two epsilons. 可以通过简单地改变固定CHAR -> ''STRING -> '' ,这也将解决由连接两个epsilons的模糊性所产生的冲突。 What remains is allowing a quoted STRING to contain a q , which contradicts the description in the pseudocode (" dqexpr -> string without " "). 剩下的就是允许带引号的STRING包含q ,这与伪代码(“ dqexpr -> string without " “的dqexpr -> string without " )中的描述相矛盾。

If the intention is to allow a term to be a concatenation of quoted strings, so that " "" " is valid (without entering into its semantics), that can be done by adding another iteration non-terminal: 如果打算让术语成为带引号的字符串的串联,那么" "" "是有效的(无需输入其语义),可以通过添加另一个非终端迭代来实现:

QTERMS -> QTERM
QTERMS -> QTERMS QTERM
QTERM  -> q STRING q

and changing TERM -> q STRING q to TERM -> QTERM . 并将TERM -> QTERM TERM -> q STRING q更改为TERM -> QTERM

My suspicion is that the desire is a simplified form of shell word processing, in which a "word" can be a concatenation of any number of terms, so that not only is " "" " a legal word, but so are " "' ' and "x"foo'y' . 我怀疑是欲望壳文字处理的简化形式,其中一个“字”可以是任何数量的术语的串联,这样不仅是" "" "一个合法的字,而且还有" "' '"x"foo'y' This allows the possibility to include both single and double quotes in the same word: "'"'"' . If we also make the assumption that a comment is equivalent to white-space, we end up with the following grammar: 这样就可以在同一个单词中同时包含单引号和双引号: "'"'"'如果我们还假设注释等于空白,则最终得到以下语法:

START -> ARGV $
ARGV -> WORD
ARGV -> ARGV WHITES WORD
WHITES -> WHITE
WHITES -> WHITES WHITE
WHITE -> _
WHITE -> star CSTRING star
WORD -> TERM
WORD -> WORD TERM
TERM -> c
TERM -> squote SQSTRING squote
TERM -> dquote DQSTRING dquote
SQSTRING -> ''
SQSTRING -> SQSTRING c
SQSTRING -> SQSTRING _
SQSTRING -> SQSTRING star
SQSTRING -> SQSTRING dquote
DQSTRING -> ''
DQSTRING -> DQSTRING c
DQSTRING -> DQSTRING _
DQSTRING -> DQSTRING star
DQSTRING -> DQSTRING squote
CSTRING  -> ''
CSTRING  -> CSTRING c
CSTRING  -> CSTRING _
CSTRING  -> CSTRING squote
CSTRING  -> CSTRING squote

The SLR tool you use successfully generates a parser using the above grammar. 您使用的SLR工具使用上述语法成功生成了一个解析器。

The grammar should actually be the same as matching parentheses, so you should be able to find lots of examples on parsing math expressions. 语法实际上应与匹配括号相同,因此您应该能够找到许多解析数学表达式的示例。

I'm not exactly sure what your example grammar is trying to accomplish but including "", '', and ** in your term rule seems odd to me. 我不确定您的示例语法将要完成什么,但是在您的term规则中包含“”,“”和**对我来说似乎很奇怪。 Try changing it to be something like: 尝试将其更改为类似以下内容:

expr -> string
expr -> ' string '
expr -> " string "

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM