简体   繁体   English

使用多态为正则表达式解析器建模

[英]Modelling a regular expression parser with polymorphism

So, I'm doing a regular expression parser for school that creates a hierarchy of objects in charge of the matching. 因此,我正在为学校做一个正则表达式解析器,它创建负责匹配的对象的层次结构。 I decided to do it object oriented because it's easier for me to imagine an implementation of the grammar that way. 我决定将其面向对象,因为这样可以使我更容易想象语法的实现。 So, these are my classes making up the regular expressions. 因此,这些是组成正则表达式的类。 It's all in Java, but I think you can follow along if you're proficient in any object oriented language. 所有这些都是用Java编写的,但是如果您精通任何面向对象的语言,我想您都可以遵循。

The only operators we're required to implement is Union (+), Kleene-Star (*), Concatenation of expressions (ab or maybe (a+b)c) and of course the Parenthesis as illustrated in the example of Concatination. 我们唯一需要实现的运算符是Union (+), Kleene-Star (*),表达式的级联(ab或(a + b)c),当然还有括号,如Concatination示例中所示。 This is what I've implemented right now and I've got it to work like a charm with a bit of overhead in the main. 这就是我现在已经实现的功能,并且使它像灵符一样工作,并且在主体上有一些开销。

The parent class, Regexp.java 父类Regexp.java

public abstract class Regexp {

    //Print out the regular expression it's holding
    //Used for debugging purposes
    abstract public void print();

    //Checks if the string matches the expression it's holding
    abstract public Boolean match(String text);

    //Adds a regular expression to be operated upon by the operators
    abstract public void add(Regexp regexp);

    /*
    *To help the main with the overhead to help it decide which regexp will
    *hold the other
    */
    abstract public Boolean isEmpty();

}

There's the most simple regexp, Base.java, which holds a char and returns true if the string matches the char. 最简单的正则表达式Base.java包含一个字符,如果字符串与该字符匹配,则返回true。

public class Base extends Regexp{
    char c;

    public Base(char c){
        this.c = c;
    }

    public Base(){
        c = null;
    }

    @Override
    public void print() {
        System.out.println(c);
    }

    //If the string is the char, return true
    @Override
    public Boolean match(String text) {
        if(text.length() > 1) return false;
        return text.startsWith(""+c);
    }

    //Not utilized, since base is only contained and cannot contain
    @Override
    public void add(Regexp regexp) {

    }

    @Override
    public Boolean isEmpty() {
        return c == null;
    }

}

A parenthesis, Paren.java, to hold a regexp inside it. 括号Paren.java,用于在其中容纳一个正则表达式。 Nothing really fancy here, but illustrates how matching works. 这里没什么好看的,但是说明了匹配是如何工作的。

public class Paren extends Regexp{
    //Member variables: What it's holding and if it's holding something
    private Regexp regexp;
    Boolean empty;

    //Parenthesis starts out empty
    public Paren(){
        empty = true;
    }

    //Unless you create it with something to hold
    public Paren(Regexp regexp){
        this.regexp = regexp;
        empty = false;
    }

    //Print out what it's holding
    @Override
    public void print() {
        regexp.print();
    }

    //Real simple; either what you're holding matches the string or it doesn't
    @Override
    public Boolean match(String text) {
        return regexp.match(text);
    }

    //Pass something for it to hold, then it's not empty
    @Override
    public void add(Regexp regexp) {
        this.regexp = regexp;
        empty = false;
    }

    //Return if it's holding something
    @Override
    public Boolean isEmpty() {
        return empty;
    }

}

A Union.java, which is two regexps that can be matched. Union.java,这是两个可以匹配的正则表达式。 If one of them is matched, the whole Union is a match. 如果其中之一匹配,则整个联合就是匹配项。

public class Union extends Regexp{
    //Members
    Regexp lhs;
    Regexp rhs;

    //Indicating if there's room to push more stuff in
    private Boolean lhsEmpty;
    private Boolean rhsEmpty;

    public Union(){
        lhsEmpty = true;
        rhsEmpty = true;
    }

    //Can start out with something on the left side
    public Union(Regexp lhs){
        this.lhs = lhs;

        lhsEmpty = false;
        rhsEmpty = true;
    }

    //Or with both members set
    public Union(Regexp lhs, Regexp rhs) {
        this.lhs = lhs;
        this.rhs = rhs;

        lhsEmpty = false;
        rhsEmpty = false;
    }

    //Some stuff to help me see the unions format when I'm debugging
    @Override
    public void print() {
        System.out.println("(");
        lhs.print();
        System.out.println("union");
        rhs.print();
        System.out.println(")");

    }

    //If the string matches the left side or right side, it's a match
    @Override
    public Boolean match(String text) {
        if(lhs.match(text) || rhs.match(text)) return true;
        return false;
    }

    /*
    *If the left side is not set, add the member there first
    *If not, and right side is empty, add the member there
    *If they're both full, merge it with the right side
    *(This is a consequence of left-to-right parsing)
    */
    @Override
    public void add(Regexp regexp) {
        if(lhsEmpty){
        lhs = regexp;

        lhsEmpty = false;
        }else if(rhsEmpty){
            rhs = regexp;

            rhsEmpty = false;
        }else{
            rhs.add(regexp);
        }
    }

    //If it's not full, it's empty
    @Override
    public Boolean isEmpty() {
        return (lhsEmpty || rhsEmpty);
    }
}

A concatenation, Concat.java, which is basically a list of regexps chained together. 串联Concat.java,基本上是链接在一起的正则表达式的列表。 This one is complicated. 这个很复杂。

public class Concat extends Regexp{
    /*
    *The list of regexps is called product and the 
    *regexps inside called factors
    */
    List<Regexp> product;

    public Concat(){
        product = new ArrayList<Regexp>();
    }

    public Concat(Regexp regexp){
        product = new ArrayList<Regexp>();
        pushRegexp(regexp);
    }

    public Concat(List<Regexp> product) {
        this.product = product;
    }

    //Adding a new regexp pushes it into the list
    public void pushRegexp(Regexp regexp){
        product.add(regexp);
    }
    //Loops over and prints them
    @Override
    public void print() {
        for(Regexp factor: product){
            factor.print();
        }
    }

    /*
    *Builds up a substring approaching the input string.
    *When it matches, it builds another substring from where it 
    *stopped. If the entire string has been pushed, it checks if
    *there's an equal amount of matches and factors.
    */
    @Override
    public Boolean match(String text) {
        ArrayList<Boolean> bools = new ArrayList<Boolean>();

        int start = 0;
        ListIterator<Regexp> itr = product.listIterator();

        Regexp factor = itr.next();

        for(int i = 0; i <= text.length(); i++){
            String test = text.substring(start, i);

            if(factor.match(test)){
                    start = i;
                    bools.add(true);
                    if(itr.hasNext())
                        factor = itr.next();
            }
        }

        return (allTrue(bools) && (start == text.length()));
    }

    private Boolean allTrue(List<Boolean> bools){
        return product.size() == bools.size();
    }

    @Override
    public void add(Regexp regexp) {
        pushRegexp(regexp);
    }

    @Override
    public Boolean isEmpty() {
        return product.isEmpty();
    }
}

Again, I've gotten these to work to my satisfaction with my overhead, tokenization and all that good stuff. 再说一次,我已经使它们对我的开销,令牌化和所有这些好东西满意。 Now I want to introduce the Kleene-star operation. 现在,我要介绍Kleene-star操作。 It matches on any number, even 0, of occurrences in the text. 它与文本中出现的任何数量(甚至0)匹配。 So, ba* would match b, ba, baa, baaa and so on while (ba)* would match on ba, baba, bababa and so on. 因此,ba *将匹配b,ba,baa,baaa等,而(ba)*将匹配ba,baba,bababa等。 Does it even look possible to extend my Regexp to this or do you see another way of solving this? 是否有可能将我的Regexp扩展到这个范围,或者您看到解决这个问题的另一种方法?

PS: There's getters, setter and all kinds of other support functions that I didn't write out, but this is mainly for you to get the point quickly of how these classes works. PS:有一些我没有写过的getter,setter和其他各种支持函数,但这主要是让您快速了解这些类的工作原理。

You seem to be trying to use a fallback algorithm to do the parsing. 您似乎正在尝试使用后备算法进行解析。 That can work -- although it is easier to do with higher-order functions -- but it is far from the best way to parse regular expressions (by which I mean the things which are mathematically regular expressions, as opposed to the panoply of parsing languages implemented by "regular expression" libraries in various languages). 这可以工作-尽管更容易处理高阶函数-但它远不是解析正则表达式的最佳方法(我的意思是,数学上是正则表达式的东西,而不是全解析的东西)由“正则表达式”库以各种语言实现的语言)。

It's not the best way because the parsing time is not linear in the size of the string to be matched; 这不是最好的方法,因为解析时间与要匹配的字符串的大小不是线性的。 in fact, it can be exponential. 实际上,它可以是指数的。 But to understand that, it's important to understand why your current implementation has a problem. 但是要了解这一点,重要的是要了解您当前的实现为什么会有问题。

Consider the fairly simple regular expression (ab+a)(bb+a) . 考虑相当简单的正则表达式(ab+a)(bb+a) That can match exactly four strings: abbb , aba , abb , aa . 可以恰好匹配四个字符串: abbbabaabbaa All of those strings start with a , so your concatenation algorithm will match the first concatenand ( (ab+a) ) at position 1, and proceed to try the second concatenand ( bb+a ). 所有这些字符串都以a开头,因此您的串联算法将匹配位置1处的第一个串联( (ab+a) ),然后继续尝试第二个串联( bb+a )。 That will successfully match abb and aa , but it will fail on aba and abbb . 那将成功地匹配abbaa ,但是在abaabbb上将失败。

Now, suppose you modified the concatenation function to select the longest matching substring rather than the shortest one. 现在,假设您修改了串联函数,以选择最长的匹配子字符串而不是最短的子字符串。 In that case, the first subexpression would match ab in three of the possible strings (all but aa ), and the match would fail in the case of abb . 在这种情况下,第一个子表达式将与三个可能的字符串中的ab匹配(除aa所有字符串),而对于abb ,匹配将失败。

In short, when you are matching a concatenation R·S , you need to do something like this: 简而言之,当您匹配串联R·S ,您需要执行以下操作:

  • Find some initial string which matches R 找到一些与R匹配的初始字符串
  • See if S matches the rest of the text 查看S与其余文本匹配
  • If not, repeat with another initial string which matches R 如果不是,请重复另一个与R匹配的初始字符串

In the case of full regular expression matches, it doesn't matter which order we list matches for R , but usually we're trying to find the longest substring which matches a regular expression, so it is convenient to enumerate the possible matches from longest to shortest. 对于完全正则表达式匹配,列出与R匹配的顺序并不重要,但通常我们会尝试查找与正则表达式匹配的最长子字符串,因此从最长的枚举中枚举可能的匹配很方便。最短。

Doing that means that we need to be able to restart a match after a downstream failure, to find the "next match". 这样做意味着我们需要能够在下游故障后重新开始比赛,以找到“下一个比赛”。 That's not terribly complicated, but it definitely complicates the interface, because all of the compound regular expression operators need to "pass through" the failure to their children in order to find the next alternative. 这并不是很复杂,但是它肯定会使接口复杂化,因为所有复合正则表达式运算符都需要将故障“传递”给子级以找到下一个替代方案。 That is, the operator R+S might first find something which matches R . 也就是说,运算符R+S可能首先找到与R匹配的东西。 If asked for the next possibility, it first has to ask R if there is another string which it could match, before moving on to S . 如果要求下一个可能性,则必须先询问R是否存在另一个可以匹配的字符串,然后再继续输入S (And that's passing over the question of how to get + to list the matches in order by length.) (这超出了如何获取+以按长度顺序列出匹配项的问题。)

With such an implementation, it's easy to see how to implement the Kleene star ( R * ), and it is also easy to see why it can take exponential time. 通过这样的实现,很容易看到如何实现Kleene star( R * ),也很容易看到为什么要花费指数时间。 One possible implementation: 一种可能的实现:

  • First, match as many R as possible. 首先,匹配尽可能多的R
  • If asked for another match: ask the last R for another match 如果要求其他比赛:要求最后一个R进行其他比赛
  • If there are no more possibilities, drop the last R from the list, and ask what is now the last R for another match 如果没有其他可能性,请从列表中删除最后一个R ,然后询问另一场比赛的最后一个R
  • If none of that worked, propose the empty string as a match 如果都不起作用,请提出一个空字符串作为匹配项
  • Fail 失败

(This can be simplified with recursion: Match an R , then match an R * . For the next match, first try the next R * ; failing that try the next R and the first following R * ; when all else fails, try the empty string.) (可以通过递归来简化此过程:匹配一个R ,然后匹配一个R * 。对于下一个匹配,请首先尝试下一个R * ;如果失败,请尝试下一个R和后面的第一个R * ;当所有其他失败时,请尝试空字符串。)

Implementing that is an interesting programming exercise, so I encourage you to continue. 实现那是一个有趣的编程练习,所以我鼓励您继续。 But be aware that there are better algorithms. 但是请注意,有更好的算法。 You might want to read Russ Cox's interesting essays on regular expression matching . 您可能想阅读Russ Cox关于正则表达式匹配的有趣文章

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM