简体   繁体   English

如何通过添加更多字符来了解字符串是否可以匹配正则表达式

[英]How to know if a string could match a regular expression by adding more characters

This is a tricky question, and maybe in the end it has no solution (or not a reasonable one, at least). 这是一个棘手的问题,也许最终它没有解决方案(至少没有合理的解决方案)。 I'd like to have a Java specific example, but if it can be done, I think I could do it with any example. 我想要一个特定于Java的示例,但是如果可以做到,我想我可以通过任何示例来做到。

My goal is to find a way of knowing whether an string being read from an input stream could still match a given regular expression pattern. 我的目标是找到一种方法来了解从输入流中读取的字符串是否仍可以匹配给定的正则表达式模式。 Or, in other words, read the stream until we've got a string that definitely will not match such pattern, no matter how much characters you add to it. 或者,换句话说,读取流,直到我们得到一个绝对不匹配这种模式的字符串,无论您添加了多少字符。

A declaration for a minimalist simple method to achieve this could be something like: 为实现这一目的而使用的简单方法的声明可能类似于:

boolean couldMatch(CharSequence charsSoFar, Pattern pattern);

Such a method would return true in case that charsSoFar could still match pattern if new characters are added, or false if it has no chance at all to match it even adding new characters. 如果添加新字符后charsSoFar仍然可以匹配pattern,则该方法将返回true ;否则,即使添加新字符也没有机会匹配它,否则将返回false

To put a more concrete example, say we have a pattern for float numbers like "^([+-]?\\\\d*\\\\.?\\\\d*)$" . 举一个更具体的例子,假设我们有一个浮点数模式,例如"^([+-]?\\\\d*\\\\.?\\\\d*)$"

With such a pattern, couldMatch would return true for the following example charsSoFar parameter: 通过这种模式,对于以下示例charsSoFar参数, couldMatch将返回true

"+"  
"-"  
"123"  
".24"  
"-1.04" 

And so on and so forth, because you can continue adding digits to all of these, plus one dot also in the three first ones. 依此类推,等等,因为您可以继续在所有这些数字中加上数字,并且在前三个数字中也加上一个点。

On the other hand, all these examples derived from the previous one should return false : 另一方面,从上一个示例派生的所有这些示例都应返回false

"+A"  
"-B"  
"123z"  
".24."  
"-1.04+" 

It's clear at first sight that these will never comply with the aforementioned pattern, no matter how many characters you add to it. 乍一看很明显,无论您添加多少个字符,这些字符都永远不会符合上述模式。

EDIT: 编辑:

I add my current non-regex approach right now, so to make things more clear. 我现在添加我当前的非正则表达式方法,以便使事情更加清楚。

First, I declare the following functional interface: 首先,我声明以下功能接口:

public interface Matcher {
    /**
     * It will return the matching part of "source" if any.
     *
     * @param source
     * @return
     */
    CharSequence match(CharSequence source);
}

Then, the previous function would be redefined as: 然后,将先前的函数重新定义为:

boolean couldMatch(CharSequence charsSoFar, Matcher matcher);

And a (drafted) matcher for floats could look like (note this does not support the + sign at the start, just the -): 浮点数(草稿)匹配器可能看起来像(请注意,开头不支持+号,仅支持-):

public class FloatMatcher implements Matcher {
    @Override
    public CharSequence match(CharSequence source) {
        StringBuilder rtn = new StringBuilder();

        if (source.length() == 0)
            return "";

        if ("0123456789-.".indexOf(source.charAt(0)) != -1 ) {
            rtn.append(source.charAt(0));
        }

        boolean gotDot = false;
        for (int i = 1; i < source.length(); i++) {
            if (gotDot) {
                if ("0123456789".indexOf(source.charAt(i)) != -1) {
                    rtn.append(source.charAt(i));
                } else
                    return rtn.toString();
            } else if (".0123456789".indexOf(source.charAt(i)) != -1) {
                rtn.append(source.charAt(i));
                if (source.charAt(i) == '.')
                    gotDot = true;
            } else {
                return rtn.toString();
            }
        }
        return rtn.toString();
    }
}

Inside the omitted body for the couldMatch method, it will just call matcher.match() iteratively with a new character added at the end of the source parameter and return true while the returned CharSequence is equal to the source parameter, and false as soon as it's different (meaning that the last char added broke the match). 在mayMatch方法的省略的正文中,它将仅迭代调用matcher.match(),并在源参数的末尾添加一个新字符,并在返回的CharSequence等于源参数的同时返回true,并在返回时立即返回false。这是不同的(意味着最后添加的字符破坏了比赛)。

You can do it as easy as 您可以轻松完成

boolean couldMatch(CharSequence charsSoFar, Pattern pattern) {
    Matcher m = pattern.matcher(charsSoFar);
    return m.matches() || m.hitEnd();
}

If the sequence does not match and the engine did not reach the end of the input, it implies that there is a contradicting character before the end, which won't go away when adding more characters at the end. 如果序列不匹配,并且引擎未到达输入的末尾,则表示末尾有一个矛盾的字符,当在末尾添加更多字符时,该字符不会消失。

Or, as the documentation says: 或者,如文档所述

Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher. 如果在此匹配器执行的最后一个匹配操作中搜索引擎命中输入的末尾,则返回true。

When this method returns true, then it is possible that more input would have changed the result of the last search. 当此方法返回true时,则可能有更多输入会更改上一次搜索的结果。

This is also used by the Scanner class internally, to determine whether it should load more data from the source stream for a matching operation. Scanner类在内部也使用此方法,以确定是否应从源流中加载更多数据以进行匹配操作。

Using the method above with your sample data yields 将上述方法与样本数据一起使用

Pattern fpNumber = Pattern.compile("[+-]?\\d*\\.?\\d*");
String[] positive = {"+", "-", "123", ".24", "-1.04" };
String[] negative = { "+A", "-B", "123z", ".24.", "-1.04+" };
for(String p: positive) {
    System.out.println("should accept more input: "+p
                      +", couldMatch: "+couldMatch(p, fpNumber));
}
for(String n: negative) {
    System.out.println("can never match at all: "+n
                      +", couldMatch: "+couldMatch(n, fpNumber));
}
should accept more input: +, couldMatch: true
should accept more input: -, couldMatch: true
should accept more input: 123, couldMatch: true
should accept more input: .24, couldMatch: true
should accept more input: -1.04, couldMatch: true
can never match at all: +A, couldMatch: false
can never match at all: -B, couldMatch: false
can never match at all: 123z, couldMatch: false
can never match at all: .24., couldMatch: false
can never match at all: -1.04+, couldMatch: false

Of course, this doesn't say anything about the chances of turning a nonmatching content into a match. 当然,这并没有说明将不匹配内容转换为匹配内容的可能性。 You could still construct patterns for which no additional character could ever match. 您仍然可以构造任何其他字符都无法匹配的模式。 However, for ordinary use cases like the floating point number format, it's reasonable. 但是,对于像浮点数格式这样的普通用例,这是合理的。

I have no specific solution, but you might be able to do this with negations. 我没有具体的解决方案,但是您可以通过否定来做到这一点。

If you setup regex patterns in a blacklist that definitely do not match with your pattern (eg + followed by char) you could check against these. 如果您在黑名单中设置的正则表达式模式绝对与您的模式不匹配(例如+后跟char),则可以进行检查。 If a blacklisted regex returns true, you can abort. 如果列入黑名单的正则表达式返回true,则可以中止。

Another idea is to use negative lookaheads ( https://www.regular-expressions.info/lookaround.html ) 另一个想法是使用否定先行( https://www.regular-expressions.info/lookaround.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM