简体   繁体   English

ANTLR4中单引号和双引号字符串的处理范围

[英]Handling scope for single and double quote strings in ANTLR4

I am working with ANTLR4 and in the process of writing grammar to handle single and double quoted strings. 我正在使用ANTLR4,并且正在编写语法来处理单引号和双引号字符串。 I am trying to use Lexer modes to scope the strings but that is not working out for me, my grammar is listed below. 我正在尝试使用Lexer模式来限制字符串的范围,但是这对我来说不起作用,我的语法在下面列出。 Is this the right way or how can I properly parse these as tokens instead of parser rules with context. 这是正确的方法还是我该如何正确地将它们解析为标记,而不是具有上下文的解析器规则。 Any insight? 有见识吗?

An example: 一个例子:

'single quote that contain "a double quote 'that has another single quote'"'

Lexer Grammar Lexer语法

lexer grammar StringLexer;

fragment SQUOTE: '\'';

fragment QUOTE:  '"';

SQSTR_START: SQUOTE     -> pushMode(SQSTR_MODE);

DQSTR_START: QUOTE      -> pushMode(DQSTR_MODE);

CONTENTS: ~["\']+;

mode SQSTR_MODE;

SQSTR_END: (CONTENTS | DQSTR_START)+ SQUOTE -> popMode;

mode DQSTR_MODE;

DQSTR_END:(CONTENTS | SQSTR_START)+ QUOTE -> popMode;

Parser 解析器

parser grammar StringParser;
options { tokenVocab=StringLexer; }

start:
    dqstr | sqstr
;

dqstr:
 DQSTR_START DQSTR_END
 ;  

sqstr:
 SQSTR_START SQSTR_END
;

ADDENDUM Thanks @Lucas Trzesniewski for an answer. 附录感谢@Lucas Trzesniewski的回答。

This is part of grammar I am writing to parse shell-like language, I could have multiple lines of script where they would have SQSTR and DQSTR . 这是我编写的用于解析类壳语言的语法的一部分,我可以使用多行脚本来编写SQSTRDQSTR With the lexer rules provided in the answer it would lump multiple lines of script together. 使用答案中提供的词法分析器规则,它将多行脚本合并在一起。

Happy case example (That get parsed correctly using the answer): 快乐案例(使用答案正确解析):

cmd 'single quote string'
cmd2 "double quote"
cmd3 'another single quote' 

This get recognized as three commands and three strings (single and double) 这被识别为三个命令和三个字符串(单和双)

Unparsed example: On the other hand - note the quote in the single quote strings: 未分析的示例:另一方面,请注意单引号字符串中的引号:

cmd 'single "quote string'
cmd2 "double quote"
cmd3 'another "single quote' 

In this case it would incorrectly detect all of them as a single string token of type SQSTR. 在这种情况下,它将所有这些错误地检测为SQSTR类型的单个字符串标记。

Any ideas how to address this problem? 任何想法如何解决这个问题?

If you want to parse your example string as a single token, you don't necessarily have to use lexer modes, you can use mutually-recursive lexer rules instead: 如果要将示例字符串解析为单个标记,则不必使用词法分析器模式,而是可以使用相互递归的词法分析器规则:

SQSTR : '\'' (~['"] | DQSTR)* '\'';
DQSTR : '"'  (~['"] | SQSTR)* '"';

Then, in the parser use something like: 然后,在解析器中使用类似以下内容的内容:

str : SQSTR | DQSTR;

Way too complicated, what you have in mind. 想法太复杂了。 Where did you see such a solution before? 您以前在哪里看到过这样的解决方案? (Almost) all grammars in the grammar repository on github which have such rules use a simple and nicely working approach, where you have an introducer, content and terminator, all in one rule, eg: (几乎)github上的语法存储库中所有具有此类规则的语法都使用一种简单且运行良好的方法,在该规则中,您具有一个介绍器,内容和终止符,并且都在一个规则中,例如:

SQSTRING: '\'' .*? '\'';
DQSTRING: '"' .*? '"';

Similarly for all other elements with that kind of structure (single quoted string, back tick quoted string, multiline comment etc.). 同样,对于具有这种结构的所有其他元素(单引号引起来的字符串,反引号引起来的字符串,多行注释等)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM