简体   繁体   English

Antlr4正则表达式语法,NFA过渡表的数据结构

[英]Antlr4 regular expression grammar, data structure for NFA transition table

I apologize for the extremely long explanation but I'm stuck for a month now and I really can't figure out how to solve this. 对于冗长的解释,我深表歉意,但现在已经停滞了一个月,我真的不知道该如何解决。 I have to derive, as a project, a compiler with antlr4 for a regex grammar that generate a program (JAVA) able to distinguish words belonging to the language generated by a regex used as input for antlr4 compiler. 作为一个项目,我必须派出一个带有用于正则表达式语法的antlr4的编译器,该编译器会生成一个程序(JAVA),该程序能够区分属于正则表达式生成的语言的单词,而正则表达式用作antlr4编译器的输入。 The grammar that we have to use is this one: 我们必须使用的语法是:

RE ::= union | simpleRE
union ::= simpleRE + RE
simpleRE ::= concatenation | basicRE
concatenation ::= basicRE simpleRE
basicRE ::= group | any | char
group ::= (RE) | (RE)∗ | (RE)+
any ::= ?
char ::= a | b | c | ··· | z | A | B | C | ··· | Z | 0 | 1 | 2 | ··· | 9 | . | − | _

and from that, I gave this grammar to antrl4 然后,我将此语法交给antrl4

Regexp.g4 Regexp.g4

grammar Regxp;

start_rule              
    : re                            # start
    ;

re
    :    union                      
    | simpleRE                      
    ;

union 
    :    simpleRE '+' re            # unionOfREs
    ;

simpleRE
    :    concatenation                      
    | basicRE                               
    ;

concatenation
    :    basicRE simpleRE                   #concatOfREs
    ;

basicRE
    :    group                      
    | any                               
    | cHAR                              
    ;


group
    :  LPAREN re RPAREN '*'             # star
    |  LPAREN re RPAREN '+'             # plus
    |  LPAREN re RPAREN                 # singleWithParenthesis
    ;


any
    :   '?'                             
    ;


cHAR
    : CHAR              #singleChar
    ;

WS : [ \t\r\n]+ -> skip ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : LETTER | DIGIT | DOT | D | UNDERSCORE
    ;
/* tokens */
fragment LETTER:    [a-zA-Z]
    ;
fragment DIGIT: [0-9]
    ;
fragment DOT:  '.'
    ;
fragment D:  '-'
    ;
fragment UNDERSCORE: '_'
    ;

Then i generated the java files from antlr4 with visitors. 然后,我与来访者从antlr4生成了Java文件。 As far as i understood the logic of the project, when the visitor is traversing the parse tree, it has to generate lines of code to fill the transition table of the NFA derived as applying the Thompson rules on the input regexp. 据我了解该项目的逻辑,当访问者遍历解析树时,它必须生成代码行来填充在输入正则表达式上应用汤普森规则得出的NFA转换表。 Then these lines of code are to be saved as a .java text file, and compiled to a program that takes in input a string (word) and tells if the word belongs or not to the language generated by the regex. 然后,将这些代码行保存为.java文本文件,并编译为一个程序,该程序接受输入的字符串(单词),并告诉该单词是否属于正则表达式生成的语言。 The result should be like this: 结果应该是这样的:

RE      word    Result
a+b       a       OK
          b       OK
         ac       KO

a∗b     aab       OK
         b        OK
       aaaab      OK
        abb       KO

So I'm asking, how can I represent the transition table in a way such that it can be filled during the visit of the parse tree and then exported in order to be used by a simple java program implementing the acceptance algorithm for an NFA? 因此,我想问一下,如何以这样一种方式来表示过渡表:可以在访问分析树时将其填充,然后导出该过渡表,以供实现NFA接受算法的简单Java程序使用? (i'm considering this pseudo-code): (我正在考虑此伪代码):

S = ε−closure(s0);
c = nextChar();
while (c ≠ eof) do
S = ε−closure(move(S,c));
c = nextChar();
end while
if (S ∩ F ≠ ∅) then return “yes”;
else return “no”;
end if

As of now I managed to make that, when the visitor is for example in the unionOfREs rule, it will do something like this: 到目前为止,我设法做到了,例如当访客处于unionOfREs规则中时,它将执行以下操作:

MyVisitor.java MyVisitor.java

private List<String> generatedCode = new ArrayList<String>();

/* ... */
@Override 
public String visitUnionOfREs(RegxpParser.UnionOfREsContext ctx) { 
    System.out.println("unionOfRExps");
    String char1 = visit(ctx.simpleRE());
    String char2 = visit(ctx.re());
    generatedCode.add("tTable.addUnion("+char1+","+char2+");");
    //then this line of code will populate the transition table
    return char1+"+"+char2;
}
/* ... */

The addUnion it's inside a java file that will contains all the methods to fill the transition table. addUnion位于Java文件中,该文件将包含用于填充过渡表的所有方法。 I wrote code for the union, but i dont' like it because it's like to write the transition table of the NFA, as you would write it on a paper: example . 我为工会编写了代码,但我不喜欢它,因为它就像编写NFA的过渡表一样,就像您在纸上写的一样: example I got this when I noticed that by building the table iteratively, you can define 2 "pointers" on the table, currentBeginning and currentEnd, that tell you where to expand again the character written on the table, with the next rule that the visitor will find on the parse tree. 当我注意到通过迭代构建表时,您可以在表上定义2个“指针”,它们分别是currentBeginning和currentEnd,它们告诉您在哪里再次扩展写在表上的字符,以及下一个访问者将要遵循的规则在分析树上找到。 Because this character can be another production or just a single character. 因为此字符可以是另一个产品,也可以只是一个字符。 On the link it is represented the written-on-paper example that convinced me to use this approach. 在链接上显示了书面实例,该实例使我确信可以使用此方法。

TransitionTable.java TransitionTable.java

/* ... */
public void addUnion(String char1, String char2) {
    if (transitionTable.isEmpty()) {
    List<List<Integer>> lc1 = Arrays.asList(Arrays.asList(null)
            ,Arrays.asList(currentBeginning+3)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(null));
    List<List<Integer>> lc2 = Arrays.asList(Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(currentBeginning+4)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(null));
    List<List<Integer>> le = Arrays.asList(Arrays.asList(currentBeginning+1,currentBeginning+2)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(currentBeginning+5)
            ,Arrays.asList(currentBeginning+5)
            ,Arrays.asList(null));

        transitionTable.put(char1, lc1);
        transitionTable.put(char2, lc2);
        transitionTable.put("epsilon", le);
        //currentBeginning += 2;
        //currentEnd = transitionTable.get(char2).get(currentBeginning).get(0);
        currentEnd = transitionTable.get("epsilon").size()-1;//il 5
        } else { //not the first time it encounters this rule, beginning and end changed
            //needs to add 2 less states
        }
    }
/* ... */

At the moment I'm representing the transition table as HashMap<String, List<List<Integer>>> strings are for chars on the edges of the NFA and List<List<Integer>> because by being non deterministic, it needs to represent more transitions from a single state. 目前,我将过渡表表示为HashMap<String, List<List<Integer>>>字符串用于NFA和List<List<Integer>>边缘的字符,因为由于不确定性,它需要表示从单一状态的更多转换。 But going this way, for a parse tree like this i will obtain this line of code for the union : "tTable.addUnion("tTable.addConcat(a,b)","+char2+");" 但是按照这种方式,对于像这样的解析树,我将为联合获取以下代码行: "tTable.addUnion("tTable.addConcat(a,b)","+char2+");"

And i'm blocked here, i don't know how to solve this and i really can't think a different way to represent the transition table or to fill it while visiting the parse tree. 而且我在这里受阻,我不知道如何解决这个问题,而且我真的想不出其他方法来表示过渡表或在访问分析树时填充过渡表。

Thank You. 谢谢。

Using Thompson's construction, every regular (sub-)expression produces an NFA, and every regular expression operator (union, cat, *) can be implemented by adding a couple states and connecting them to states that already exists. 使用Thompson的构造,每个正则(子)表达式都会产生NFA,并且每个正则表达式运算符(union,cat,*)都可以通过添加几个状态并将它们连接到已经存在的状态来实现。 See: 看到:

https://en.wikipedia.org/wiki/Thompson%27s_construction https://en.wikipedia.org/wiki/Thompson%27s_construction

So, when parsing the regex, every terminal or non-terminal production should add the required states and transitions to the NFA, and return its start and end state to the containing production. 因此,在解析正则表达式时,每个终端或非终端产品都应将所需的状态和转换添加到NFA中,并将其开始和结束状态返回到包含的产品中。 Non-terminal productions will combine their children and return their own start+end states so that your NFA can be built from the leaves of the regular expression up. 非最终生产将合并其子代并返回其自己的开始+结束状态,以便可以从正则表达式的叶子开始构建NFA。

The representation of the state table is not critical for building. 状态表的表示形式对于构建并不重要。 Thompson's construction will never require you to modify a state or transition that you built before, so you just need to be able to add new ones. 汤普森(Thompson)的构造永远不需要您修改之前建立的状态或转换,因此您只需要能够添加新的状态或转换即可。 You will also never need more than one transition from a state on the same character, or even more than one non-epsilon transition. 同样,从同一角色的状态开始,您也将不需要多个过渡,甚至不需要一个非ε过渡。 In fact, if all your operators are binary you will never need more than 2 transitions on a state. 实际上,如果您所有的运算符都是二进制的,则一个状态永远不需要两个以上的转换。 Usually the representation is designed to make it easy to do the next steps, like DFA generation or direct execution of the NFA against strings. 通常,该表示形式旨在简化后续步骤,例如DFA生成或针对字符串直接执行NFA。

For example, a class like this can completely represent a state: 例如,这样的类可以完全表示一个状态:

class State
{
    public char matchChar;
    public State matchState; //where to go if you match matchChar, or null
    public State epsilon1; //or null
    public State epsilon2; //or null
}

This would actually be a pretty reasonable representation for directly executing an NFA. 对于直接执行NFA,这实际上是一个相当合理的表示。 But if you already have code for directly executing an NFA, then you should probably just build whatever it uses so you don't have to do another transformation. 但是,如果您已经具有直接执行NFA的代码,则可能应该构建任何它使用的内容,这样就不必进行其他转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM