简体   繁体   English

ANTLR4-在JavaScript语法中解析正则表达式文字

[英]ANTLR4 - parsing regex literals in JavaScript grammar

I'm using ANTLR4 to generate a Lexer for some JavaScript preprocessor (basically it tokenizes a javascript file and extracts every string literal). 我正在使用ANTLR4为某些JavaScript预处理器生成Lexer(基本上它会标记化javascript文件并提取每个字符串文字)。

I used a grammar originally made for Antlr3, and imported the relevant parts (only the lexer rules) for v4. 我使用了最初为Antlr3制作的语法,并为v4导入了相关部分(仅词法规则)。

I have just one single issue remaining: I don't know how to handle corner cases for RegEx literals, like this: 我只剩下一个问题:我不知道如何处理RegEx文字的极端情况,如下所示:

log(Math.round(v * 100) / 100 + ' msec/sample');

The / 100 + ' msec/ is interpreted as a RegEx literal, because the lexer rule is always active. / 100 + ' msec/被解释为RegEx文字,因为lexer规则始终处于活动状态。

What I would like is to incorporate this logic (C# code. I would need JavaScript, but simply I don't know how to adapt it): 我想要的是合并这种逻辑(C#代码。我需要JavaScript,但我根本不知道如何适应它):

    /// <summary>
    /// Indicates whether regular expression (yields true) or division expression recognition (false) in the lexer is enabled.
    /// These are mutual exclusive and the decision which is active in the lexer is based on the previous on channel token.
    /// When the previous token can be identified as a possible left operand for a division this results in false, otherwise true.
    /// </summary>
    private bool AreRegularExpressionsEnabled
    {
        get
        {
            if (Last == null)
            {
                return true;
            }

            switch (Last.Type)
            {
                // identifier
                case Identifier:
                // literals
                case NULL:
                case TRUE:
                case FALSE:
                case THIS:
                case OctalIntegerLiteral:
                case DecimalLiteral:
                case HexIntegerLiteral:
                case StringLiteral:
                // member access ending 
                case RBRACK:
                // function call or nested expression ending
                case RPAREN:
                    return false;

                // otherwise OK
                default:
                    return true;
            }
        }
    }

This rule was present in the old grammar as an inline predicate, like this: 该规则在旧语法中作为内联谓词出现,如下所示:

RegularExpressionLiteral
    : { AreRegularExpressionsEnabled }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;

But I don't know how to use this technique in ANTLR4. 但是我不知道如何在ANTLR4中使用这种技术。

In the ANTLR4 book, there are some suggestions about solving this kind of problems at the parser level (chapter 12.2 - context sensitive lexical problems), but I don't want to use a parser. 在ANTLR4书中,有一些关于在解析器级别解决此类问题的建议(第12.2节-上下文相关的词汇问题),但是我不想使用解析器。 I want just to extract all the tokens, leave everything untouched except for the string literals, and keep the parsing out of my way. 我只想提取所有令牌,除了字符串文字外,其余所有内容都保持不变,并且不让我解析。

Any suggestion would be really appreciated, thanks! 任何建议将不胜感激,谢谢!

I'm posting here the final solution, developed adapting the existing one to the new syntax of ANTLR4, and addressing the differences in JavaScript syntax. 我将在此处发布最终解决方案,并开发出使现有解决方案适应ANTLR4的新语法的方法,并解决JavaScript语法的差异。

I'm posting just the relevant parts, to give a clue to someone else about a working strategy. 我只发布相关部分,以向其他人提供有关工作策略的线索。

The rule was edited as follows: 规则编辑如下:

RegularExpressionLiteral
    : DIV {this.isRegExEnabled()}? RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;

The function isRegExEnabled is defined in a @members section on top of the lexer grammar, as follows: 函数isRegExEnabled在词法分析器语法的@members部分中定义,如下所示:

@members {
EcmaScriptLexer.prototype.nextToken = function() {
  var result = antlr4.Lexer.prototype.nextToken.call(this, arguments);
  if (result.channel !== antlr4.Lexer.HIDDEN) {
    this._Last = result;
  }

  return result;
}

EcmaScriptLexer.prototype.isRegExEnabled = function() {
  var la = this._Last ? this._Last.type : null;
  return la !== EcmaScriptLexer.Identifier &&
    la !== EcmaScriptLexer.NULL &&
    la !== EcmaScriptLexer.TRUE &&
    la !== EcmaScriptLexer.FALSE &&
    la !== EcmaScriptLexer.THIS &&
    la !== EcmaScriptLexer.OctalIntegerLiteral &&
    la !== EcmaScriptLexer.DecimalLiteral &&
    la !== EcmaScriptLexer.HexIntegerLiteral &&
    la !== EcmaScriptLexer.StringLiteral &&
    la !== EcmaScriptLexer.RBRACK &&
    la !== EcmaScriptLexer.RPAREN;
}}

As you can see, two functions are defined, one is an override of lexer's nextToken method, which wraps the existing nextToken and saves the last non-comment-or-whitespace token for reference. 如您所见,定义了两个函数,一个是lexer的nextToken方法的重写,该方法包装现有的nextToken并保存最后一个非注释或空格标记以供参考。 Then, the semantic predicate invokes isRegExEnabled checking if the last significative token is compatible with the presence of RegEx literals. 然后,语义谓词调用isRegExEnabled,以检查最后一个有意义的标记是否与RegEx文字的存在兼容。 If it's not, it returns false. 如果不是,则返回false。

Thanks to Lucas Trzesniewski for the comment: it pointed me in the right direction, and to Patrick Hulsmeijer for the original work on v3. 感谢Lucas Trzesniewski的评论:它为我指明了正确的方向,并感谢Patrick Hulsmeijer提供了有关v3的原始作品。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM