简体   繁体   English

匹配带有空格的单词作为一个标记,但不允许某些关键字标记

[英]match words with spaces as one token but disallow certain keyword tokens

I have the following token rules:我有以下令牌规则:

IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';

WORD: (DIGIT* (LOWERCASE | UPPERCASE | WORDSYMBOL)) (LOWERCASE | UPPERCASE | DIGIT | WORDSYMBOL)*;

This works, where something like my variable comes out as WORD WORD .这行得通,像my variable这样的东西以WORD WORD形式出现。 I want to be able to have just the one token, which represents the whole thing.我希望能够只拥有一个代表整个事物的令牌。

I hanged it to:我把它挂在:


IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';

WORD: (LOWERCASE | UPPERCASE | WORDSYMBOL)+ (' '* (LOWERCASE | UPPERCASE | WORDSYMBOL))*;

This fixed that, however it also captures character strings that I'd like classified as a keyword token as above.这解决了这个问题,但是它也捕获了我想分类为上述关键字标记的字符串。

For example if my variable then something shouldn't just be a single WORD token, it should be IF WORD THEN WORD .例如if my variable then something应该只是单个WORD标记,它应该是IF WORD THEN WORD

I understand why it's being tokenized as it is (tokens consuming more of the input are preferred), but am not sure how to change the behaviour.我理解为什么要按原样对其进行标记(首选消耗更多输入的标记),但不确定如何更改行为。

Unfortunately (for what you'd like to do), that's not how ANTLR's Tokenization works.不幸的是(对于您想做的事情),ANTLR 的标记化不是这样工作的。

(This is more a "logical" explanation rather than the actual implementation) (这更像是一个“合乎逻辑”的解释,而不是实际的实现)

When ANTLR is evaluating Lexer rules, it will find attempt to match each rule with characters in your input stream beginning with your current position in that input stream.当 ANTLR 评估 Lexer 规则时,它会尝试将每个规则与输入 stream 中的字符匹配,该输入 stream 中以当前 position 开头。

Once it has the all of the input sequences that match, if there is one sequence that is longer than the rest, it will choose the Token type that produces the longest token.一旦它拥有所有匹配的输入序列,如果有一个序列比 rest 长,它将选择产生最长令牌的令牌类型。 This is where your WORD rule is going to consume input until if finds something that doesn't match as a character in a WORD (and that will include "slurping up" keywords if they match the WORD pattern).这是您的WORD规则将使用输入的地方,直到找到与WORD中的字符不匹配的内容(如果它们与WORD模式匹配,则将包括“slurping”关键字)。

(For completeness) If the Tokenizer finds more than one equal length match, the 1st rule that matches in your grammar will be the Token type assigned. (为了完整性)如果 Tokenizer 发现多个等长匹配,则在您的语法中匹配的第一个规则将是分配的 Token 类型。


You might have success with the following approach:您可能会通过以下方法获得成功:

Assumption: WORD cannot be one of your language keywords假设: WORD不能是您的语言关键字之一

  • make sure that the WORD rule is after all of your keyword rules so that they'll take priority.确保WORD规则位于所有关键字规则之后,以便它们优先。
  • add a Parser rule word: WORD+;添加解析器规则word: WORD+;
  • now use the word parser rule everywhere you would have used the RULE token.现在在您使用RULE标记的任何地方都使用 parser rule word
  • Write a Listener that overrides enterWord() and merge all the WORD s into a single "word".编写一个覆盖enterWord()的侦听器并将所有WORD合并为一个“单词”。 (You could handle this step several ways, but this is one, fairly simple, approach) (您可以通过多种方式处理此步骤,但这是一种相当简单的方法)

caveats:警告:

  • There's a reason that languages do not typically allow for this.语言通常不允许这样做是有原因的。 I suspect you'll encounter other complications/ambiguities down the road.我怀疑你会在路上遇到其他并发症/模棱两可。
  • Performance MAY be impacted as ANTLR has to do more look-ahead to know when to backtrack.性能可能会受到影响,因为 ANTLR 必须做更多的预测才能知道何时回溯。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM