简体   繁体   中英

How to know if a string could match a regular expression by adding more characters

This is a tricky question, and maybe in the end it has no solution (or not a reasonable one, at least). I'd like to have a Java specific example, but if it can be done, I think I could do it with any example.

My goal is to find a way of knowing whether an string being read from an input stream could still match a given regular expression pattern. Or, in other words, read the stream until we've got a string that definitely will not match such pattern, no matter how much characters you add to it.

A declaration for a minimalist simple method to achieve this could be something like:

boolean couldMatch(CharSequence charsSoFar, Pattern pattern);

Such a method would return true in case that charsSoFar could still match pattern if new characters are added, or false if it has no chance at all to match it even adding new characters.

To put a more concrete example, say we have a pattern for float numbers like "^([+-]?\\\\d*\\\\.?\\\\d*)$" .

With such a pattern, couldMatch would return true for the following example charsSoFar parameter:

"+"  
"-"  
"123"  
".24"  
"-1.04" 

And so on and so forth, because you can continue adding digits to all of these, plus one dot also in the three first ones.

On the other hand, all these examples derived from the previous one should return false :

"+A"  
"-B"  
"123z"  
".24."  
"-1.04+" 

It's clear at first sight that these will never comply with the aforementioned pattern, no matter how many characters you add to it.

EDIT:

I add my current non-regex approach right now, so to make things more clear.

First, I declare the following functional interface:

public interface Matcher {
    /**
     * It will return the matching part of "source" if any.
     *
     * @param source
     * @return
     */
    CharSequence match(CharSequence source);
}

Then, the previous function would be redefined as:

boolean couldMatch(CharSequence charsSoFar, Matcher matcher);

And a (drafted) matcher for floats could look like (note this does not support the + sign at the start, just the -):

public class FloatMatcher implements Matcher {
    @Override
    public CharSequence match(CharSequence source) {
        StringBuilder rtn = new StringBuilder();

        if (source.length() == 0)
            return "";

        if ("0123456789-.".indexOf(source.charAt(0)) != -1 ) {
            rtn.append(source.charAt(0));
        }

        boolean gotDot = false;
        for (int i = 1; i < source.length(); i++) {
            if (gotDot) {
                if ("0123456789".indexOf(source.charAt(i)) != -1) {
                    rtn.append(source.charAt(i));
                } else
                    return rtn.toString();
            } else if (".0123456789".indexOf(source.charAt(i)) != -1) {
                rtn.append(source.charAt(i));
                if (source.charAt(i) == '.')
                    gotDot = true;
            } else {
                return rtn.toString();
            }
        }
        return rtn.toString();
    }
}

Inside the omitted body for the couldMatch method, it will just call matcher.match() iteratively with a new character added at the end of the source parameter and return true while the returned CharSequence is equal to the source parameter, and false as soon as it's different (meaning that the last char added broke the match).

You can do it as easy as

boolean couldMatch(CharSequence charsSoFar, Pattern pattern) {
    Matcher m = pattern.matcher(charsSoFar);
    return m.matches() || m.hitEnd();
}

If the sequence does not match and the engine did not reach the end of the input, it implies that there is a contradicting character before the end, which won't go away when adding more characters at the end.

Or, as the documentation says:

Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher.

When this method returns true, then it is possible that more input would have changed the result of the last search.

This is also used by the Scanner class internally, to determine whether it should load more data from the source stream for a matching operation.

Using the method above with your sample data yields

Pattern fpNumber = Pattern.compile("[+-]?\\d*\\.?\\d*");
String[] positive = {"+", "-", "123", ".24", "-1.04" };
String[] negative = { "+A", "-B", "123z", ".24.", "-1.04+" };
for(String p: positive) {
    System.out.println("should accept more input: "+p
                      +", couldMatch: "+couldMatch(p, fpNumber));
}
for(String n: negative) {
    System.out.println("can never match at all: "+n
                      +", couldMatch: "+couldMatch(n, fpNumber));
}
should accept more input: +, couldMatch: true
should accept more input: -, couldMatch: true
should accept more input: 123, couldMatch: true
should accept more input: .24, couldMatch: true
should accept more input: -1.04, couldMatch: true
can never match at all: +A, couldMatch: false
can never match at all: -B, couldMatch: false
can never match at all: 123z, couldMatch: false
can never match at all: .24., couldMatch: false
can never match at all: -1.04+, couldMatch: false

Of course, this doesn't say anything about the chances of turning a nonmatching content into a match. You could still construct patterns for which no additional character could ever match. However, for ordinary use cases like the floating point number format, it's reasonable.

I have no specific solution, but you might be able to do this with negations.

If you setup regex patterns in a blacklist that definitely do not match with your pattern (eg + followed by char) you could check against these. If a blacklisted regex returns true, you can abort.

Another idea is to use negative lookaheads ( https://www.regular-expressions.info/lookaround.html )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM