简体   繁体   中英

Modelling a regular expression parser with polymorphism

So, I'm doing a regular expression parser for school that creates a hierarchy of objects in charge of the matching. I decided to do it object oriented because it's easier for me to imagine an implementation of the grammar that way. So, these are my classes making up the regular expressions. It's all in Java, but I think you can follow along if you're proficient in any object oriented language.

The only operators we're required to implement is Union (+), Kleene-Star (*), Concatenation of expressions (ab or maybe (a+b)c) and of course the Parenthesis as illustrated in the example of Concatination. This is what I've implemented right now and I've got it to work like a charm with a bit of overhead in the main.

The parent class, Regexp.java

public abstract class Regexp {

    //Print out the regular expression it's holding
    //Used for debugging purposes
    abstract public void print();

    //Checks if the string matches the expression it's holding
    abstract public Boolean match(String text);

    //Adds a regular expression to be operated upon by the operators
    abstract public void add(Regexp regexp);

    /*
    *To help the main with the overhead to help it decide which regexp will
    *hold the other
    */
    abstract public Boolean isEmpty();

}

There's the most simple regexp, Base.java, which holds a char and returns true if the string matches the char.

public class Base extends Regexp{
    char c;

    public Base(char c){
        this.c = c;
    }

    public Base(){
        c = null;
    }

    @Override
    public void print() {
        System.out.println(c);
    }

    //If the string is the char, return true
    @Override
    public Boolean match(String text) {
        if(text.length() > 1) return false;
        return text.startsWith(""+c);
    }

    //Not utilized, since base is only contained and cannot contain
    @Override
    public void add(Regexp regexp) {

    }

    @Override
    public Boolean isEmpty() {
        return c == null;
    }

}

A parenthesis, Paren.java, to hold a regexp inside it. Nothing really fancy here, but illustrates how matching works.

public class Paren extends Regexp{
    //Member variables: What it's holding and if it's holding something
    private Regexp regexp;
    Boolean empty;

    //Parenthesis starts out empty
    public Paren(){
        empty = true;
    }

    //Unless you create it with something to hold
    public Paren(Regexp regexp){
        this.regexp = regexp;
        empty = false;
    }

    //Print out what it's holding
    @Override
    public void print() {
        regexp.print();
    }

    //Real simple; either what you're holding matches the string or it doesn't
    @Override
    public Boolean match(String text) {
        return regexp.match(text);
    }

    //Pass something for it to hold, then it's not empty
    @Override
    public void add(Regexp regexp) {
        this.regexp = regexp;
        empty = false;
    }

    //Return if it's holding something
    @Override
    public Boolean isEmpty() {
        return empty;
    }

}

A Union.java, which is two regexps that can be matched. If one of them is matched, the whole Union is a match.

public class Union extends Regexp{
    //Members
    Regexp lhs;
    Regexp rhs;

    //Indicating if there's room to push more stuff in
    private Boolean lhsEmpty;
    private Boolean rhsEmpty;

    public Union(){
        lhsEmpty = true;
        rhsEmpty = true;
    }

    //Can start out with something on the left side
    public Union(Regexp lhs){
        this.lhs = lhs;

        lhsEmpty = false;
        rhsEmpty = true;
    }

    //Or with both members set
    public Union(Regexp lhs, Regexp rhs) {
        this.lhs = lhs;
        this.rhs = rhs;

        lhsEmpty = false;
        rhsEmpty = false;
    }

    //Some stuff to help me see the unions format when I'm debugging
    @Override
    public void print() {
        System.out.println("(");
        lhs.print();
        System.out.println("union");
        rhs.print();
        System.out.println(")");

    }

    //If the string matches the left side or right side, it's a match
    @Override
    public Boolean match(String text) {
        if(lhs.match(text) || rhs.match(text)) return true;
        return false;
    }

    /*
    *If the left side is not set, add the member there first
    *If not, and right side is empty, add the member there
    *If they're both full, merge it with the right side
    *(This is a consequence of left-to-right parsing)
    */
    @Override
    public void add(Regexp regexp) {
        if(lhsEmpty){
        lhs = regexp;

        lhsEmpty = false;
        }else if(rhsEmpty){
            rhs = regexp;

            rhsEmpty = false;
        }else{
            rhs.add(regexp);
        }
    }

    //If it's not full, it's empty
    @Override
    public Boolean isEmpty() {
        return (lhsEmpty || rhsEmpty);
    }
}

A concatenation, Concat.java, which is basically a list of regexps chained together. This one is complicated.

public class Concat extends Regexp{
    /*
    *The list of regexps is called product and the 
    *regexps inside called factors
    */
    List<Regexp> product;

    public Concat(){
        product = new ArrayList<Regexp>();
    }

    public Concat(Regexp regexp){
        product = new ArrayList<Regexp>();
        pushRegexp(regexp);
    }

    public Concat(List<Regexp> product) {
        this.product = product;
    }

    //Adding a new regexp pushes it into the list
    public void pushRegexp(Regexp regexp){
        product.add(regexp);
    }
    //Loops over and prints them
    @Override
    public void print() {
        for(Regexp factor: product){
            factor.print();
        }
    }

    /*
    *Builds up a substring approaching the input string.
    *When it matches, it builds another substring from where it 
    *stopped. If the entire string has been pushed, it checks if
    *there's an equal amount of matches and factors.
    */
    @Override
    public Boolean match(String text) {
        ArrayList<Boolean> bools = new ArrayList<Boolean>();

        int start = 0;
        ListIterator<Regexp> itr = product.listIterator();

        Regexp factor = itr.next();

        for(int i = 0; i <= text.length(); i++){
            String test = text.substring(start, i);

            if(factor.match(test)){
                    start = i;
                    bools.add(true);
                    if(itr.hasNext())
                        factor = itr.next();
            }
        }

        return (allTrue(bools) && (start == text.length()));
    }

    private Boolean allTrue(List<Boolean> bools){
        return product.size() == bools.size();
    }

    @Override
    public void add(Regexp regexp) {
        pushRegexp(regexp);
    }

    @Override
    public Boolean isEmpty() {
        return product.isEmpty();
    }
}

Again, I've gotten these to work to my satisfaction with my overhead, tokenization and all that good stuff. Now I want to introduce the Kleene-star operation. It matches on any number, even 0, of occurrences in the text. So, ba* would match b, ba, baa, baaa and so on while (ba)* would match on ba, baba, bababa and so on. Does it even look possible to extend my Regexp to this or do you see another way of solving this?

PS: There's getters, setter and all kinds of other support functions that I didn't write out, but this is mainly for you to get the point quickly of how these classes works.

You seem to be trying to use a fallback algorithm to do the parsing. That can work -- although it is easier to do with higher-order functions -- but it is far from the best way to parse regular expressions (by which I mean the things which are mathematically regular expressions, as opposed to the panoply of parsing languages implemented by "regular expression" libraries in various languages).

It's not the best way because the parsing time is not linear in the size of the string to be matched; in fact, it can be exponential. But to understand that, it's important to understand why your current implementation has a problem.

Consider the fairly simple regular expression (ab+a)(bb+a) . That can match exactly four strings: abbb , aba , abb , aa . All of those strings start with a , so your concatenation algorithm will match the first concatenand ( (ab+a) ) at position 1, and proceed to try the second concatenand ( bb+a ). That will successfully match abb and aa , but it will fail on aba and abbb .

Now, suppose you modified the concatenation function to select the longest matching substring rather than the shortest one. In that case, the first subexpression would match ab in three of the possible strings (all but aa ), and the match would fail in the case of abb .

In short, when you are matching a concatenation R·S , you need to do something like this:

  • Find some initial string which matches R
  • See if S matches the rest of the text
  • If not, repeat with another initial string which matches R

In the case of full regular expression matches, it doesn't matter which order we list matches for R , but usually we're trying to find the longest substring which matches a regular expression, so it is convenient to enumerate the possible matches from longest to shortest.

Doing that means that we need to be able to restart a match after a downstream failure, to find the "next match". That's not terribly complicated, but it definitely complicates the interface, because all of the compound regular expression operators need to "pass through" the failure to their children in order to find the next alternative. That is, the operator R+S might first find something which matches R . If asked for the next possibility, it first has to ask R if there is another string which it could match, before moving on to S . (And that's passing over the question of how to get + to list the matches in order by length.)

With such an implementation, it's easy to see how to implement the Kleene star ( R * ), and it is also easy to see why it can take exponential time. One possible implementation:

  • First, match as many R as possible.
  • If asked for another match: ask the last R for another match
  • If there are no more possibilities, drop the last R from the list, and ask what is now the last R for another match
  • If none of that worked, propose the empty string as a match
  • Fail

(This can be simplified with recursion: Match an R , then match an R * . For the next match, first try the next R * ; failing that try the next R and the first following R * ; when all else fails, try the empty string.)

Implementing that is an interesting programming exercise, so I encourage you to continue. But be aware that there are better algorithms. You might want to read Russ Cox's interesting essays on regular expression matching .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM